Benchmarking pipelines for subclonal deconvolution of bulk tumour sequencing data

Somatic point mutation calling suggests Mutect2 is best for subclonal variants

We applied four somatic variant callers (Mutect211, Strelka212, VarScan213 and Lancet14) to a highly mutated tumour dataset (S3R3) at 100% tumour purity across all sequencing depths. Runtimes differed substantially between methods, with Strelka2 being fastest and Mutect2 taking longest (Supplemental Fig. 1). Performance was assessed on both unfiltered and filtered (an additional step attempting to triage mutation calls based on the probability of being a technical artifact, non-somatic or a sequencing error) call sets. Precision-recall curves, using mutation probabilities or scores in thresholding, show low recall rates for all callers, as expected due to the low cellular frequency of many subclones (Fig. 3a). Nonetheless, most mutations with a variant allele frequency (VAF) ≥0.1 were detected by most callers at >100x coverage; the notable exception being VarScan2 (Fig. 3b). Strelka2_filtered achieved the highest F1 score (which conveys the balance between precision and recall) of 0.526–0.738, followed closely by Mutect2_filtered (0.484–0.723), while VarScan2 scored lowest of the filtered call sets (0.246–0.319). The high precisions (0.882–0.896) achieved by Strelka2_filtered pertained to calls within target exon regions, but precision dropped substantially (0.360–0.660) when expanding to calls across the whole genome, suggesting a high false positive call rate in lower coverage regions. The other callers showed similar precisions across exons and the whole genome (Supplemental Tables 1 and 2). Two additional call sets consisting of the union and intersect of calls from the best achieving variant callers, Mutect2_filtered and Strelka2_filtered, were analysed to investigate the benefits of ensemble approaches. Union call sets had slightly improved F1 scores owing to small increases in recall with minimal loss of precision. Intersect call sets, however, had reduced F1 scores resulting from reduced recall (Fig. 3a, Supplemental Table 1). To investigate potential differences between WES and whole-genome sequencing (WGS), we created in silico WGS reads for sample S2R2_B_100x and compared variant calling performance from this with WES for the same sample. Of note, Strelka2_filtered dropped in both precision and recall for WGS compared to WES, whereas Mutect2_filtered maintained a similar performance (Supplemental Table 3).

Fig. 3: Performance of variant callers applied to sequencing data from tumour S3R3 with 100% purity at 30x, 60x, 100x and 250x coverage.
figure3

Each variant calling method was assessed across target exon regions, both pre- and post-filtering (denoted with _filtered suffix). S2_fil-M2_fil_union is the call set resulting from the union of results from Strelka2_filtered and Mutect2_filtered, with the intersect of results from those callers denoted by S2_fil-M2_fil_intersect. a Precisions–recall curves. b Recall according to true variant allele frequency (VAF). c Precision according to called variant allele frequency. Source data are provided as a Source data file.

Subclonal somatic copy number calls are most accurately quantified by FACETS

When used as input for subclonal deconvolution, somatic copy number alteration (CNA) calls must accurately denote regions of chromosomal gain or loss but also the allelic copy numbers. We therefore assessed CNA callers in three ways: (1) purity and ploidy estimates were assessed through direct comparison to true values; (2) heatmaps of predicted and true total and allelic copy numbers were generated to assess the relative performance of methods and the factors affecting them; (3) results from all the CNA callers were used with subclonal deconvolution methods, and the accuracy of the resulting CCF estimates were compared.

We applied four somatic allele-specific CNA callers (Sclust15, FACETS16, Sequenza17 and TITAN18) to samples from all nine simulated tumours at 100 and 75% purity and across sequencing depths of 30x–250x. We further included 50 and 25% purity samples at 250x for three of the tumours. Ploidy and purity estimates were largely accurate, particularly at ≥50% purity, with the exceptions that TITAN often incorrectly assessed purity and Sequenza frequently overestimated ploidy (Fig. 4, Supplemental Table 3). None of the callers were able to reliably detect whole-genome duplications, and instead estimated them as being diploid. This is to be expected as all approaches must use relative depth ratios across the genome to predict more focal CNAs, making whole-genome duplications almost undetectable. Sequencing depth was not found to affect ploidy or purity estimates (Fig. 4, Supplemental Table 3).

Fig. 4: Accuracy of purity and ploidy estimates from tested somatic copy number alteration (CNA) callers.
figure4

Dashed lines denote accurate ploidy and purity estimates, and dotted lines indicate ploidy estimates where the caller misclassed samples as tetraploid instead of diploid, or vice versa, but were otherwise accurate. Source data are provided as a Source data file.

Heatmaps of gains and losses along the genome for predicted and true copy numbers also indicated that sequencing depth had little effect on performance, with the majority of CNAs called at 250x also called at 30x (Supplemental Fig. 2). An exception was with Sequenza, which falsely called multiple loss of heterozygosity regions at 250x only, as a result of over-segmentation. Tumour purity also had a limited effect on results. All methods, and particularly Sclust, failed to detect CNAs present in low-frequency subclones, where the total copy number remains close to two, although FACETS was most sensitive. TITAN frequently falsely called large regions of single allele whole-genome gains or losses, particularly in the S1 samples, which have the lowest variant frequencies. None of the methods were able to detect subclonal whole-genome duplications with the exception of Sequenza which, when it did, identified them as clonal.

CCF estimation is most accurate at higher sequencing depth and following clustering

The performance of five subclonal deconvolution methods (PyClone19, PyClone-VI7, FastClone20, Ccube21 and Sclust15) was assessed on all tumours and sequencing depths, at 100 and 75% purity. We used the Mutect2_filtered variant calls combined with CNA calls, purity and ploidy estimates from all four CNA callers as inputs for subclonal deconvolution (Fig. 2). PyClone, PyClone-VI, FastClone and Ccube were run with CNA inputs from Sequenza, FACETS and TITAN, whereas Sclust was run using only its own CNA calls. PyClone-VI performs clustering using either binomial or beta-binomial distributions so we ran it with both separately. PyClone was unable to complete within 48 h (the time limit on our high-performance computing system) for most of the highly mutated S3 samples and these were, therefore, omitted. FastClone did not converge for 61 of the 216 runs so these were also not included. The performance of each pipeline was determined through comparison of true variant CCFs (calculated by summing the tumour proportions of clones that contained each variant) to both the estimated non-clustered variant CCFs and the CCF of clusters that the variants were subsequently assigned to, where applicable. This enabled inspection of whether accuracy is impacted primarily at VAF normalisation or during clustering.

To provide additional simplistic references upon which CCF estimation methods should improve, we developed baseline datasets via two approaches. For the non-clustered CCFs, “doubled VAFs” was implemented in which CCFs were estimated by simply doubling VAFs, dividing by true purity (which was generally accurately estimated by all CNA calling methods) and limiting to ≤1 i.e. assuming a purely heterozygous diploid genome. As a baseline for comparing the clustered CCFs against, we applied K-means clustering (n = 3) to the doubled VAFs.

The mean absolute differences (MADifs) between true and predicted CCFs are shown in Fig. 5 for both non-clustered and clustered CCF estimates with low values indicating better accuracy. In addition, mean absolute adjusted Rand index is calculated to indicate the accuracy of the variant clustering in each pipeline (Supplemental Table 5). When using non-clustered CCF estimates the best performing pipeline is FACETS with PyClone, though this only marginally improves upon the baseline of doubled VAFs and only at sequencing depths ≥100x (Fig. 5a). When using clustered CCF estimates, the MADif and adjusted Rand index metrics largely agreed, with the best performing pipeline being FACETS with PyClone-VI run with beta binomial used for clustering. The results do show that clustering CCFs improves estimates, and most pipelines improve upon the simplistic baseline, created by K-means clustering of doubled VAFs, at all sequencing depths but, again, only marginally (Fig. 5b, Supplemental Table 5).

Fig. 5: True CCF (x-axis) versus predicted CCF (y-axis) from subclonal deconvolution pipelines applied to tumours sequenced to different depths.
figure5

Plots are separated by predictions based on a non-clustered CCFs, for which ‘Doubled VAFs’ represents the simplistic baseline from which to assess performance; and b clustered CCFs, for which K-means provides the simplistic baseline. N = 18 samples at each depth. MADif: weighted mean absolute difference of true vs. predicted CCFs with the depths of colour corresponding to the value. CCF cancer cell fraction, VAF variant allele frequency. Source data are provided as a Source data file.

The performance metrics did not vary much between pipelines using different CCF estimation methods, with the exception of FastClone which performed worse than any of the others. In contrast, those using FACETS as CNA caller consistently showed improved accuracy and those using TITAN performed the worst. Both non-clustered and clustered CCF estimates increased in accuracy with increasing sequencing depth, whereas, varying the frequency of mutations (which increased from S1 to S3 tumours) did not show a consistent influence (Supplemental Fig. 3).

We next sought to investigate the effects of lower tumour purities on the top three performing tumours (S1R1, S2R1, S2R3), using samples at 100, 75, 50 and 25% purity. This showed a decrease in accuracy with decreasing purity, particularly at 25% (Supplemental Fig. 2), which likely partially results from the poorer purity estimates from CNA callers at that purity. To further assess the robustness of our findings, we ran the pipelines on additional WES datasets for samples with reduced numbers of clones and altered clone topologies, as well as the WGS datasets for the top (S1R1) and worst (S2R2) performing WES samples. In all investigations, the relative performance of pipelines was largely preserved, with FACETS with PyClone-VI_beta-binomial remaining the top performers (Supplemental Figs. 46).

Read more here: Source link