Prostate cancer

The cell-free DNA methylome captures distinctions between localized and metastatic prostate tumors

This work complies with all relevant ethical regulations. All patients provided informed written consent and all samples were obtained upon approval of the institutional ethics committee and Research Ethics Board at the University Health Network (UHN) and the University of British Columbia (UBC), with compliance with all relevant ethical regulations. CPC and Barrier samples were retrieved from the UHN GU Biobank (REB file numbers: 11-0024(CPC) and 13-7122(Barrier)). VPC cfDNA from plasma was retrieved from the Vancouver Prostate Centre (VPC), UBC (REB file number: H18-00944). The WCDT cfDNA samples were retrieved from the University of California San Francisco (UCSF). The OHS samples were retrieved from the Ontario Institute for Cancer Research (OICR).

Contact for reagent and resource sharing

Further information and requests for resources should be directed to and will be fulfilled by Lead Contact, Housheng Hansen He ([email protected]).

Experimental models and subject details

Method details

Cell-free DNA isolation from plasma

Peripheral blood was collected from cancer patients using EDTA anticoagulant tubes. Plasma samples were isolated from whole blood using the UHN Biobank centrifugation protocol and stored at the UHN Biobank. 500 µl–1 ml plasma aliquots retrieved from the Biobank were immediately stored at −80 °C for short-term use. The cfDNA was isolated from plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen) according to the manufacturer’s protocol and quantified by Qubit (Thermo Fisher Scientific) before use. Within each experimental batch, samples were randomized by disease status and performed blinded during cfMeDIP-seq wet-lab processing.

Cell-free methylated DNA immunoprecipitation and sequencing

To prepare cfMeDIP libraries for sequencing, the original cfMeDIP-seq protocol was used16 on 5 ng of input cfDNA per sample. First, the samples underwent library preparation using Kapa HyperPrep Kit (Kapa Biosystems) for end-repair and A-tailing, following the manufacturer’s instructions. Samples were then ligated to 0.181uM of NEBNext adaptor (NEBNext Multiplex Oligos for Illumina kit, New England Biolabs) by incubating at 20 °C for 20 mins. The DNA was then purified with AMPure XP beads (Beckman Coulter). The library was then digested using USER enzyme (New England Biolabs) and then purified with Qiagen MinElute PCR purification kit (MinElute columns).

The prepared libraries were then combined with 95 ng of filler DNA (λ phage), and then MeDIP was performed using the Diagenode MagMeDIP kit (C02010021) using a previously published protocol26. The filler DNA consists of a mixture of unmethylated and in vitro methylated λ amplicons of different CpG densities: 1 CpG site, 5 CpG sites, 10 CpG sites, 15 CpG sites, and 20 CpG sites; all similar in size to cfDNA. This filler DNA ensures a constant ratio of antibody to input DNA and minimizes non-specific binding by the antibody and prevents cfDNA loss due to binding to plasticware. Once the prepared library and filler DNA were combined, 0.3 ng of control methylated and 0.3 ng of control unmethylated Arabidopsis thaliana DNA and the buffers from the MagMeDIP kit were added, as per the manufacturer’s instructions. The mixture was heated to 95 °C for 10 min, then immediately placed on ice for 10 mins. Each sample was partitioned into two 0.2 ml PCR tubes: one for 10% input control (7.9 µl) and the other for the sample to be subjected to immunoprecipitation (79 µl). The included 5mC monoclonal antibody (C15200081) from the MagMeDIP kit was diluted to 1:15 before adding it to the immunoprecipitation sample. MagMeDIP magnetic beads were then washed 2× with prepared buffers from the kit and added to the sample before incubation at 4 °C for 17 h with rotation. The samples were purified using the Diagenode iPure Kit v2 (C03010015) and eluted in 50 µl of buffer C.

Quality control 1 (QC1) was performed by qPCR to detect recovery of the spiked-in methylated and unmethylated A. thaliana DNA. The recovery of methylated A. thaliana DNA should be >20%, unmethylated A. thaliana DNA should be <1% (relative to the input control and adjusted to input control being 10% of the overall sample), and the specificity of the reaction should be >99% (1−[recovery of spike-in unmethylated DNA/ recovery of spike-in methylated DNA] × 100) to proceed. The PCR cycle number for library amplification was determined by qPCR (QC2) and should be <15 cycles to proceed, and the samples were amplified using Kapa HiFi Hotstart Mastermix and NEBNext multiplex oligos, added to a final concentration of 0.3uM. The final libraries were amplified as follows: activation at 95 °C for 3 min, # cycles: 98 °C for 20 s, 65 °C for 15 s, and 72 °C for 30 s, and a final extension of 72 °C for 1 min. The amplified libraries were purified using MinElute columns, then size selected to remove adaptor dimers by either using 3% Nusieve GTG agarose gel and subsequent get cutting, or Pippin Prep (Sage Science) following the manufacturer’s instructions. All the final libraries were then checked at TapeStation (Agilent) for library concentration, correct sizing, then pooled with six other cfMeDIP samples with different NEBNext barcodes. The pool of seven samples (per lane) was sequenced at 150 bp paired-end on Illumina HiSeq X ten.

Publicly available data

The TCGA prostate adenocarcinoma (PRAD) 450 K methylation data (hg19 based) were downloaded from the TCGA Data Portal (, including 50 normal tissue and 489 primary tumor samples. Associated clinical data and normalized gene expression were also obtained. The CPC-GENE (Canadian Prostate Cancer Genome Network) 450 K methylation data from 286 patients with localized prostate adenocarcinoma, matching normalized gene expression and clinic information (hg19 based) were obtained from the previous publication22. Processed whole-genome bisulfite sequencing (WGBS, hg38-based), RNA-seq, and clinical data for 194 Asian patients with localized tumors and matched healthy tissue were obtained from CPGEA46, the Chinese Prostate Cancer Genome and Epigenome Atlas ( Processed WGBS and matched RNA-seq data (hg38-based) for 100 WCDT mCRPC (West Coast Dream Team, metastatic castration-resistant PCa) were obtained from the previous publication21. The WGBS data from WCDT and CPGEA cohorts were converted to methylation values for the same 300 bins used in cfMeDIP-seq data analysis. Methylation value was defined as the total number of methylated counts divided by the total number of (methylated and unmethylated) counts in this 300 bp region. The hg38 genome coordinates were converted to hg19 using liftOver (v1.10.0) R package with a chain file retrieved from the USCS genome browser (

Genes (CTAG1B, TSPY1, MAGEA3, and PAGE1) with promoter hypomethylation in PCa were obtained from the previous publication37. A total of 136 PCa driver genes were obtained from previous publications and DriverDBv3, consisting of 57 oncogenes and 79 tumor suppressor genes38,39. Bins within the promoter region (1 kb upstream of TSS) of these genes were compared in the cfMeDIP-seq data.

Quantification and statistical analysis

Statistical analyzes were performed using R statistical environment (v3.6.1) (R Core Team, 2019). All tests were two-sided unless otherwise specified. The type of test method used for statistical analysis was specified in the text where the results were described and details for the test were explained in the relevant figure legend and method section.

Sequencing data preprocessing

Human genome (hg19/ GRCh37) was downloaded from the University of California Santa Cruz (UCSC) genome browser ( The quality of raw reads was assessed using FastQC60 (v0.11.5) and MultiQC61 (v0.8). Trim Galore (v0.5.0, (“–phred33 –stringency 3 –length 20 -e 0.1”) was used to remove adapters and trim poor-quality sequencing reads. After trimming, the reads were aligned to the human reference genome using BWA62 (v0.7.15) with default parameters. SAMtools63 (v1.3.1) with default settings was used to convert Sam to Bam format, filter out duplicates, sort and index the files and provide mapping statistics for the output. For paired-end data, we filtered for properly paired alignments using SAMtools63 (“-h -f 2 -F 512”).

Fragment size analysis

The fragment sizes for each sample were calculated using the CollecInsertSizeMetrics function from Picard (v2.6.0) ( on the sorted bam files, setting the minimum percentage option to 0.5. The longer fragment ratio was defined as the proportion of the number of reads from 170 bp to 210 bp to the number of reads from100bp to 210 bp.

For fragmentation profiles, customized scripts from previous study36 ( were applied on sorted bam files to calculate fragment ratios in 5-Mb bins.

Calculation of DMRs

DMRs between metastatic (67 VPC cohort samples) and localized (30 CPC cohort samples) PCa were identified using DESeq264 (v1.24.0) while controlling for age differences. Before detecting DMRs, the count generated by MEDIPS65 (v1.34.0) was first converted into reads per kilobase per million mapped reads (RPKM) using the total number of reads as the library size. Only bins with higher than 5 RPKM in at least one sample across all PCa samples were retained (non-low coverage bins). The raw counts of these non-low coverage bins were used as input for DESeq264. Bins with Benjamini-Hochberg adjusted p-value <0.05 and absolute fold change greater than 2 were nominated as DMRs.

DMRs annotation and enrichment analysis

The genomic annotations of DMRs were obtained using the R packages annotatr66 (v1.12.1), TxDb.Hsapiens.UCSC.hg19.knownGene (v3.2.2) and (v3.10.0) from Bioconductor67,68. ChIP-seq data for important TFs were collected from the gene expression omnibus (GEO) (Supplementary Data 4). Customized annotation using the TF ChIP-seq data was performed using GenomicRanges69 (v1.38.0). To assess whether DMRs are enriched or depleted in the annotated regions, adjacent regions of DMRs were first merged using the ‘reduce’ function of GenomicRanges69. Association analysis was then performed by regioneR70 (v1.16.2) with a permutation test (1000 iterations). The 33,740 300 bp non-low bins were used as background regions. P value of 0.05 was used as a cutoff for significance.

Differential gene expression analysis between high and low GR-DMR methylation groups

The most noticeable hyper-DMR site (chr5:142782301-142782600, referred to as “GR-DMR”) in metastasis compared to localized plasma samples is in the promoter of GR (also known as NR3C1) gene. To investigate the transcriptional effect of this site, we used public datasets from the TCGA, CPGEA, CPC, and WCDT cohorts20,21,46,47. Considering that this site has low coverage in most of the samples from these cohorts, the top 10 samples with the highest and lowest methylation values were selected for comparison. Samples were grouped into GR-high and GR-low groups according to methylation levels on the GR site, and differentially expressed genes (DEGs) were determined using matched RNA-seq data. The DEGs were identified using DESeq2 (v1.24.0)64 with –FDR = 0.05, –log2FC = 1”.

For the 450 K DNA methylation array from TCGA PRAD and CPC cohorts22, the beta value of the CpG site overlapping with this DMR region was regarded as the methylation signal of this region. The schematics of the GR gene were plotted using the R package Gviz (v1.30.3).

Gene enrichment analysis

Gene enrichment analysis was performed using TCGAbiolinks71 (v2.14.0) with an FDR of 0.01 as cutoff. For DEGs from tissue RNA-seq data, the upregulated and downregulated genes were analyzed separately in the functions or pathways enrichment analysis. For DMRs from plasma cfMeDIP-seq data, genes with DMRs located within the 5Kb upstream of transcription start sites were used.

Predicting CNAs using plasma cfMeDIP data

To assess the ability of cfMeDIP-seq data on determining CNA events of the data, the CaSpER48 (v0.1.0) R package was used for analysis. Briefly, CaSpER first preforms data smooth on three different length scale. CNA detection was then performed by taking into account both the three-scale smoothed DNA methylation signals and whole-genome allelic shift profiles inferred from plasma cfMeDIP-seq data. On each scale: 1 CNA states (gain, loss, or neutral) were assigned using Hidden Markov Model (HMM). 2 The genome-wide allelic shift profiles were estimated using Gaussian mixture model to correct CNA predictions with relatively low evidence of methylation signals. In our analysis, localized samples from the CPC cohort were used as reference. The final consistent CNA calls were defined as CNA identified by at least six times from all pairwise scale comparisons. For the VPC cohort, inferred CNA states were compared to the gold standard calls from panel sequencing obtained previously11.

Machine learning for diagnostic classification

To evaluate the performance for diagnostic tumor classification based on plasma cfMeDIP-seq, we randomly selected an equal number of unique samples from the localized and mCRPC cohorts as training sets and use the remaining unique patients different from training sets as testing sets. For mCRPC samples from the Barrier and WCDT cohorts, 3 out of 14 and 19 out of 22 unique samples were used, respectively. Feature selection and model construction were performed in training sets, and the model performance was evaluated in testing sets. Briefly, to reduce the impact of technical factors, first, the methylated values were corrected and normalized using sva72 (v3.32.1) and DESeq264 (v1.24.0), considering sequencing batch as a confounder. Second, DMRs between localized and metastatic patients from the training set were identified using DESeq264 (v1.24.0) as described above. Third, the top 150 hyper-DMRs and hypo-DMRs were then selected by measuring information gain and used to build a classification model using randomForest (v4.6.14) ( Finally, the performance of the randomForest classifier was evaluated on the testing set. The AUROC (area under the receiver operating characteristics) curves were estimated using the probability from the random forest model and used for visualization. This procedure was repeated 50 times.

Performance assessment

We used several evaluation metrics to assess the classification performance of localized and metastatic PCa, including sensitivity (1), specificity (2), precision (3), accuracy (4), and F1 score (5). These indicators were also employed in the evaluation of CNA prediction by cfMeDIP-seq profiles.











TP stands for true positive, TN for true negative, FP for false positive, and FN for false negative. Sensitivity (also known as Recall) indicates the fraction of positive patients that are correctly predicted. Specificity (also known as Selectivity) indicates the fraction of negative patients that are correctly predicted. Precision indicates the fraction of correctly identified positive patients to the total identified positive patients. Accuracy indicated the fraction of correctly identified patients to the total observed patients. F1 score is a comprehensive indicator calculated by combining precision and sensitivity, with a higher score representing better performance. AUROC curve was calculated using the R package ROCR (v1.07)73.

Survival analysis

Kaplan–Meier plots were created using survival (v3.1.8) and survminer (v0.4.6), in which p value of survival between two groups was calculated using a log-rank test (cutoff p value = 0.05). For overall DMR analysis, we first calculated a hyper:hypo DMR ratio by dividing the mean methylated values of all hyper-DMRs by the mean methylated value of all hypo-DMRs. mCRPC samples from the VPC cohort were then split into high and low according to the median ratio (0.9786). For the fragment analysis, the fragment value was defined as the fragment ratio of the number of reads from 170 bp to 210 bp to the number reads from 100 bp to 210 bp. Patients were classified into shorter or longer fragment groups based on the median value of the fragment ratio. For GR-DMR-related analysis, we divided the patients into high GR-methylated and low GR-methylated groups according to the median value of the GR-DMR methylation. A similar analysis was performed for mRNA expression between patients with high and low GR gene expression.

The overall survival (OS) and PFS survival were used for the metastatic samples from the VPC cohort, and the OS also was used for the metastatic samples from the WCDT cohort. The biochemical recurrence-free survival was used for the localized samples from CPC and CPGEA cohorts. The clinic endpoints of recurrence-free survival were used for the 498 TCGA samples.

Repeat region analysis

To analyze repeat regions, we used a peak strategy to summarize cfMeDIP-seq data signal. We first filtered bam files using samtools (v1.3.1) to obtain high-quality primary alignments (-F 1804). Next, we used the MACS74 (v2.2.5) “callpeak” function to generate narrowPeak on all samples with –SPMR parameter to generate a normalized pileup file. Pileup files from the peak calling step were converted to bigWig files using the ucsctools (v378). Peak files from all samples were merged to create a peak catalog. Mean signal intensities were summarized for each of the intervals in the peak catalog using bwtool (v1.0) from sample bigWig files.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.