Comparison of sequencing data processing pipelines and application to underrepresented African human populations | BMC Bioinformatics

Literature survey

We reviewed the processing pipelines of 29 HTS studies, 23 of which focus on human populations and six on other mammals (listed in Table 1).

Table 1 List of studies included in the literature survey

We summarized the information for some processing steps in Table 2 (see Additional file 1 for more details): BAM processing (indel realignment and GATK’s BQSR), variant calling (GATK’s HaplotypeCaller (HC) and GenotypeGVCFs or GATK’s UnifiedGenotyper (UG)), and callset recalibration (GATK’s Variant Quality Score Recalibration (VQSR) or hard filtering).

Table 2 Overview of the steps in 29 HTS studies

“BQSR” is a step in the BAM processing pipeline, where base quality scores are recalibrated, to correct for biases due to the sequencing. It requires a set of known variants, for example dbSNP [40].

“Hard filtering” designates a callset filtering strategy where variants are kept or removed depending on user defined thresholds for variants’ annotations of interest. VQSR, on the other hand, is an approach that learns the features of “true” variants and gives a score to the remaining variants. It requires several datasets: a “truth resource” (used here: HapMap 3 and polymorphic sites from the Omni 2.5 M SNP array), a “training resource” (used here: 1000G) and a “known sites resource” (used here: dbSNP). The truth and the training resources are used to train the recalibration model which tries to characterize the relationship between the variants’ annotations and the probability that a variant is a true variant or an artefact. The known sites resource is used to stratify metrics (such as the transition to transversion ratio) between variants found in the known sites resource and new variants. The user then decides on a “tranche threshold”. For example, a tranche threshold of 99.9 means that 99.9% of the variants in the truth set will be included—and all of the variants which have a score as high as these 99.9% will pass the filter. For more background, see [9, 10, 41].

Among the 29 HTS studies, the pipelines are very diverse (Table 2 and Additional file 1). Of the 23 studies on human data, only four have the BQSR, HC + GenotypeGVCFs and VQSR steps; of these, three also ran the indel realignment step with GATK while the fourth did it with another software. Of the six studies on other mammals, two have the BQSR, HC + GenotypeGVCFs and VQSR steps. We observe that the majority of the included studies use GATK for at least one step (20 from 23 human studies, six from six studies with other mammals); however this is possibly an effect of our strategy for selecting studies. 13 of the 23 human studies use GATK for at least two steps. Of the human studies, we could determine with certainty that GATK was used for indel realignment in 11 studies; BQSR in 11 studies; HC for eight studies; VQSR for nine studies; and UG for six studies.

The majority of variants are identical for different BAM processing (for a given set of individuals)

We compared three BAM processing pipelines (pipelines 1, 2 and 3 in Fig. 1). “BP2019” is the output of the Best Practices workflow in 2019. The “BP2015” workflow includes an extra step, indel realignment, corresponding to the Best Practices in 2015. That step is redundant with the HC variant caller (because HC includes local remapping in regions where there seem to be variants) but was not removed from the Best Practices directly after the introduction of HC. The “3mask” workflow has a custom BQSR step (as in [3], and similar to what is done when working with organisms lacking reference datasets [34, 42, 43]). We performed SNP VQSR with a tranche threshold of 99.9 for each of these callsets, resulting in “BP2019vqsred”, “BP2015vqsred” and “3maskvqsred”.

Fig. 1
figure1

Three (plus one) BAM processing and variants calling pipelines. The dashed lines are relative to the comparisons mentioned in the text. Ind. = individuals

The three pipelines were applied to a set of 28 high coverage genomes (average genome depth, with duplicates, directly after mapping: 19.6X–74.6X, mean across individuals: 39.3X, Additional files 2, 3) [8, 21, 44]. The individuals represent five different ancestries, with a focus on Sub-Saharan African ancestries: the dataset includes six individuals with European background; four Yoruba individuals (western Africa); four Dinka individuals (eastern Africa); seven Khoe-San individuals (five Ju|’hoansi, two #Khomani), representing hunter-gatherers from southern Africa; and seven rainforest hunter-gatherers (two Biaka, five Mbuti) from central Africa.

We collected various metrics of the callsets using Picard’s CollectVariantCallingMetrics, before and after VQSR, for the entire callset and for each individual. Some of these metrics are reported in Table 3.

Table 3 Metrics in “BP2019”, “BP2015”, and “3mask”, at callset and individual level, before and after VQSR

Since we do not have a bona fide true callset to compare our results to, we decided to consider “BP2019” as the callset to compare the other callsets to, hence all comparisons are relative to “BP2019” (if not specified otherwise). Before VQSR, the callset for “BP2019” consists of 20,301,167 biallelic SNPs, 85,510 multiallelic SNPs, 2,599,873 simple indels, 737,325 complex indels (see [45] for a definition of complex indels), and 7,975,044 singletons (variants appearing only once in the whole sample—depending on the type, these are a subset of the SNPs or indels) (Table 3). 3.32% of the biallelic SNPs and 5.70% of simple indels are absent from dbSNP v.151. Similar counts are obtained in “BP2015” and “3mask”; the largest difference is for the number of multiallelic SNPs in “3mask”, which is increased by 0.2514% compared to “BP2019” (i.e. count in “3mask” = 1.002514 * count in “BP2019”). “BP2015” and “3mask” have higher counts than “BP2019” for all but the number of singletons, where “BP2015” has a count decreased by 0.0025% compared to “BP2019”. The difference between “3mask” and “BP2019” is larger than between “3mask” and “BP2015”, except for the number of complex indels that shows the reverse tendency.

After running VQSR for SNPs, respectively 3.36% of biallelic SNPs and 11.94% of multiallelic SNPs are filtered out in “BP2019”. The corresponding percentages for “BP2015” are 3.47% and 12.14%, and for “3mask” they are 3.55% and 12.38%. Thus, the fraction of filtered SNPs is larger in “BP2015” and in “3mask” than in “BP2019”. This reverses the tendency of more SNPs in “BP2015” and “3mask” before VQSR: after VQSR, there are less bi- and multiallelic SNPs and less singletons in “BP2015” and “3mask” than in “BP2019”. In fact, “BP2015” has 3.37% more filtered variants and “3mask” 5.71% (these large differences with “BP2019” are a combination of less variants to start with in “BP2019” and a smaller proportion of filtered out variants in “BP2019”). After SNP VQSR, “3mask” has 0.14% less biallelic SNPs, 0.25% less multiallelic SNPs, and 0.10% less SNP singletons than “BP2019”; the corresponding percentages for “BP2015” are 0.11%, 0.22% and 0.12% less than “BP2019”. The proportion of biallelic SNPs absent from dbSNP v.151 decreases to 3.13% in “BP2019”.

We also looked at individual metrics (Table 3 and Additional file 3). On average in “BP2019”, an individual has 4,443,858 (stdev: 438,889.45) biallelic SNPs (by “biallelic SNPs” we mean than an individual’s genotype is different from homozygous reference, at a position with one alternative allele in the callset); 32,358 (stdev: 3,943.62) multiallelic SNPs (same definition as above except for positions with two or more alternative alleles in the callset); 508,132 (stdev: 46,623.55) simple indels; 346,818 (stdev: 25,452.05) complex indels; and 284,823 (stdev: 69,889.22) singletons. The individual with the highest number of biallelic SNPs (4,916,206) is a Ju|’hoansi (SGDPJUH1) while the individual with the lowest number of biallelic SNPs (3,442,414) is a French sample (HGDPFRE4). Similarly, it is always a French sample that has the lowest counts for multiallelic SNPs, simple and complex indels and singletons. A Khoe-San individual has the highest counts for multiallelic SNPs and simple indels; a Biaka individual (rainforest hunter-gatherer) has the highest count for the number of singletons; and, surprisingly, a non-African, the 1000GCEU2 individual (European ancestry from Utah, 1000 Genomes dataset) has the highest count of complex indels.

Comparing “3mask” and “BP2015” to “BP2019” we observed similar patterns for averages per individual as for the entire callset: in general, higher counts in “3mask” and “BP2015” (except for the count of multiallelic SNPs and singletons in “BP2015”), the largest difference being an increase of 0.2103% for the average number of complex indels in “3mask”. The increase in variants per individual in “3mask” compared to “BP2019” is significant for the five types of variants considered (one-sided paired t-test, p-values: 2*10–12 for biallelic SNPs, 1*10–11 for multiallelic SNPs, 2*10–4 for singletons, 3*10–5 for simple indels and 2*10–7 for complex alleles). The increase in “BP2015” compared to “BP2019” is significant for three types of variants (one-sided paired t-test, p-values: 3*10–10 for biallelic SNPs, 4*10–5 for simple indels and 2*10–12 for complex alleles); for multiallelic SNPs and singletons there is no significant difference. The SNP VQSR filter removes more variants in “3mask” and “BP2015” than in “BP2019”: 0.5827% more filtered SNPs in “BP2015” and 2.8252% more in “3mask”. Consequently, after SNP VQSR the average number of bi- and multiallelic SNPs and singletons are highest in “BP2019”: “3mask” has 0.0534% less biallelic SNPs, 0.1559% less multiallelic SNPs, and 0.10% less singletons than “BP2019”. The corresponding percentages for “BP2015” are 0.0143%, 0.0886%, and 0.12% (less than “BP2019”). The decrease in variants in “3mask” and “BP2015” compared to “BP2019” is significant (one-sided paired t-test, p-values: respectively 2*10–10 and 5*10–4 for biallelic SNPs, 2*10–11 and 6*10–8 for multiallelic SNPs, 4*10–4 and 9*10–4 for singletons).

The similarities of counts for different features, for the entire callset and by individual, after three different ways of processing BAM files, suggest that the callsets are similar. We investigated this using GATK CombineVariants. Figure 2A and Additional file 4A show the partitioning of all variants (SNPs and indels). Before VQSR, the majority of the variants (99.82% of all variants combined) are identified by the three approaches. In particular, 99.94% of “BP2019” variants are in the intersection. The next largest fraction is variants found only in “3mask” (20,234 variants or 0.0845% of the combined variants). The pair of VCFs sharing most variants is “BP2015” and “BP2019”, followed by “3mask” and “BP2015”. The “3mask” approach results in the most private variants and “BP2015” the least. When summing all variants for each of the approaches—based on the CombineVariants output—we obtain higher counts than those reported in Table 3. This is due to complex variation, for example at the same position one of the VCFs has a SNP and the other has an indel. We verified the patterns described above by using GATK CombineVariants on the biallelic SNPs only (Fig. 2B, Additional file 4B). The same patterns are observed.

Fig. 2
figure2

Most variants are common to “BP2019”, “BP2015” and “3mask”. Venn diagrams of the variants in “BP2019”, “BP2015” and “3mask” before VQSR (A and B) and after VQSR (C and D). The diagrams are not to scale. The percentages in parenthesis represent the percentage of all variants combined which are in the intersection. A All variant sites, before VQSR. B Biallelic SNPs, before VQSR. C All variant sites, after VQSR. D Bbiallelic SNPs, after VQSR

Finally, we performed the same analysis after VQSR. The results for all variants are in Fig. 2C and Additional file 4C, and for biallelic SNPs in Fig. 2D and Additional file 4D. Similar tendencies are observed for all variants and for biallelic SNPs. Considering biallelic SNPs, 96.23% of all variants are retained in the three VCFs after VQSR; 3.29% are removed from the three VCFs. The remaining 0.49% are variants found in only one VCF or variants found in two or three VCFs that have different filtering status. After VQSR, “BP2019vqsred” has almost three times more private biallelic SNPs (33,088) than “3maskvqsred” and “BP2015vqsred” (respectively 12,879 and 12,124). The same pair of VCFs than before VQSR share the most variants: “BP2015” and “BP2019”.

The overlap between the three pipelines is larger when restricting the analysis to regions of the genome accessible to short-read sequencing (1000 Genomes accessibility mask) (Additional file 5). Before VQSR (Additional file 5A, B), the results are qualitatively similar to those presented in Fig. 2 (e.g. most private variants in “3mask”). After VQSR (Additional file 5C, D), the tendencies are different, with roughly three times as many private variants in “3maskvqsred” and “BP2015vqsred” than in “BP2019vqsred”.

We were interested in a possible effect of the population background (or ancestry) on the differences between the callsets. We plotted the difference in total number of SNPs by individual (kept after VQSR) between “3maskvqsred” and “BP2019vqsred”. Figure 3A shows the corresponding boxplots for each ancestry. The medians are similar in the five ancestries, and there is less variation in the Khoe-San and in the rainforest hunter-gatherers (RHG). When plotting according to dataset (Fig. 3B) the effect is much clearer. The difference between “3mask” and “BP2019” is smallest for the individuals from [44]-referred to as HGDP dataset- (average: + 0.004%), followed by the “1000 Genomes” two individuals (average: − 0.038%), and finally the individuals from the Simon Genome Diversity Project (SGDP) dataset (average: − 0.068%). Similar tendencies are observed for the difference in total number of indels by individual: the dataset impacts more the difference than the ancestry (Additional file 6). On the other hand, another metrics, the percentage of known variants (i.e. present in dbSNP v.151), seems to depend rather on the ancestry than on the dataset (Additional files 7, 8).

Fig. 3
figure3

Differences in number of SNPs per individual are explained by dataset rather than ancestry. Boxplots of the difference between the number of SNPs per individual in “3maskvqsred” and “BP2019vqsred”, in percentage of “BP2019vqsred” (a negative percentage indicates more variants in “BP2019vqsred”). A Individuals are grouped by ancestry. B Individuals are grouped by dataset

The callset is impacted by the number of individuals at the joint genotyping step

One specificity of the GATK Best Practices is that the BAM pre-processing and the initial variant calling (HC) is run by individual. Only the joint genotyping step (GenotypeGVCFs) and downstream analyses (for example VQSR) are performed for the entire cohort at the same time. We compared the variant counts for our 28 individuals, first when the joint genotyping is done only for these 28 individuals (“3mask”), second when joint genotyping is done in a larger cohort (179 individuals) and the 28 individuals are extracted (“3mask + 28”, see Fig. 1). Note that this analysis was not done with the GATK Best Practices (i.e. not with “BP2019”). Metrics are reported in Table 4.

Table 4 Metrics in “3mask” and “3mask + 28”, at callset and individual level, before and after VQSR

Before VQSR, there are more variants in the “3mask + 28” callset than in the “3mask” callset (+ 0.60% for SNPs and + 1.23% for indels). This is also observed at the individual level, though to a smaller extent (+ 0.09% for SNPs and + 0.54% for indels). For SNPs, the increase is larger for multiallelic SNPs -i.e. SNPs that have more than one non-reference allele in the subset of 28 individuals- (for example before VQSR for the entire callset: + 0.60% for bi- and + 9.70% for multi-allelic SNPs). For indels on the other hand, the increase is due solely to more complex indels—there is a decrease in the proportion of simple indels. After SNP VQSR, we observed less biallelic SNPs in “3mask + 28” than in “3mask” at the callset level (− 0.23%). The number of multiallelic SNPs remains higher in “3mask + 28” (+ 6.43%). At the individual level, both the number of bi- and of multiallelic SNPs remain higher in “3mask + 28” (respectively 0.12% and 3.57%).

In the same way that we compared the variants in “BP2019”, “BP2015” and “3mask”, we investigated whether similar sets of variants were found in “3mask” and “3mask + 28”. Before filtering, 98.66% of the combined variants are called in the two VCFs. 1.06% are called only in “3mask + 28” and about four time less (0.28%) are called only in “3mask”. After SNP VQSR, 94.73% of the combined variants pass in the two callsets and 2.84% fail in the two callset. 1.46% of variants were found with both approaches but have different filtering outcomes.

Thus, it appears that in general, the two approaches call the same variants; with slightly more variants when there are 179 individuals at the joint genotyping step rather than 28. In particular, there is an increase of multiallelic SNPs and complex indels. However, for biallelic SNPs the picture changes after SNP VQSR at the callset level, with slightly less biallelic SNPs in “3mask + 28” than in “3mask” at the callset level.

For the SNPs kept after VQSR, some of the multiallelic variants in “3mask + 28” are biallelic in “3mask” (from 2.1% for chromosome 6 to 18.9% for chromosome 1, average when summing across chromosomes: 6.5%). These variants have a higher missingness than SNPs in general (18.15% average missingness, versus 1.44% average missingness for SNPs kept after VQSR in “3mask”). We investigated these sites more closely to determine which allele is called in place of the third allele in “3mask” (the reference or the first alternate allele). We looked at the number of alleles as a proxy for that, but we did not check the genotypes at an individual level. 6.1% of the variants have complex patterns, e.g. different numbers of alleles genotyped in the two VCFs (annotation “AN”) or three alternate alleles in “3mask + 28”. Another 84.4% of the variants have the reference allele called in “3mask” (for at least one of the second alternate allele copies); and the remainder, 9.5%, have the alternate allele called in “3mask”. The mean number of copies of the second alternate allele (in “3mask + 28”) did not differ significantly between the sites where the reference respectively the alternate is called (1.15 respectively 1.14 alleles, Student’s t-test: p value 0.6126); nor did the mean number of genotyped alleles (46.69 respectively 46.27 alleles, Student’s t-test: p value 0.4106). We conclude that there is a bias towards the reference allele at these sites, but note that these sites have higher than average missingness and are likely difficult to sequence, map or call.

The same approach could be applied to indels, though it is more complicated as the indels that differ between the two callsets are often in complex regions (for example with several indels in a row).

Individual coverage might impact the number of variants

When possible, it is recommended to work with “high coverage” (or high depth) data. However, coverage can vary a lot between and within studies, which can potentially lead to biases. Here, we examined the correlation between individual coverage and number of SNPs (Fig. 4, Spearman’s rank correlation test, rho: 0.18, p value: 0.3576). We started by testing whether having an average depth above 30X (referred to as “> 30X”) or below or equal to 30X (referred to as “≤ 30X”) has an impact on the total number of SNPs after VQSR (for this we used “BP2019vqsred”). Five individuals, one from each of the populations, are in the “≤ 30X” category. The minimum coverage is ~ 19X. The difference in mean number of SNPs between the two groups is not significant (Wilcoxon rank-sum test, p value: 0.07112). As suggested by Fig. 4A, a confounding factor could be the population background: we know that the number of SNPs is greater in African than in non-African individuals. We performed the Wilcoxon rank-sum test in each population; the difference in mean number of SNPs between the two groups was not significant in any of the five populations.

Fig. 4
figure4

The total number of SNPs by individual is a function of coverage and ancestry. Total number of SNPs (bi- and multiallelic) per individual in “BP2019vqsred”. The y-axis starts at 3,400,000 SNPs. A Coloured by ancestry (the dots from a given ancestry are connected by lines). B Coloured by dataset

Another limitation with “BP2019vqsred” could be the sample size. We performed the same test in a larger dataset: the “3mask + vqsred” dataset—same processing as in “3mask” but over 100 individuals at the joint genotyping step (Fig. 5, Spearman’s rank correlation test, rho: 0.3887117, p value: 9.713e-08). In this dataset, there is a significant correlation between coverage and number of SNPs called. The difference in mean number of SNPs between the “> 30X” and the “≤ 30X” samples is also significant (Wilcoxon rank-sum test, p value: 0.000136). There are two differences between “BP2019vqsred” and “3mask + vqsred”: number of individuals and processing. To rule out that the significance in “3mask + vqsred” and the non-significance in “BP2019vqsred” is due solely to the difference in processing, we did the same test in “3maskvqsred” (same processing as “3mask + vqsred” for the same set of individuals as “BP2019vqsred”); here the test is not significant (p value: 0.08204), i.e. the same as for “BP2019vqsred”. Thus it is more likely that the lack of significance of the tests in “BP2019vqsred” and “3maskvqsred” is due to the small sample size.

Fig. 5
figure5

The total number of SNPs by individual in a larger dataset. Total number of SNPs (bi- and multiallelic) per individual in “3mask + vqsred”. The y-axis starts at 3,100,000 SNPs. Dots are coloured according to groups (the dots from a given group are connected by lines). The non-hunter-gatherer Sub-Saharan Africans are not shown as it is a very diverse group with respect to ancestry. RHG = Rainforest hunter-gatherers

Another factor which impacts the number of variants is the population background. In “3mask + vqsred”, the proportion of individuals of non-African ancestry is larger in the “≤ 30X” group (0.42) than in the “> 30X” group (0.11). In order to limit the effect of the population background, we performed a Wilcoxon rank-sum test between the “> 30X” and “≤ 30X” groups considering individuals of African ancestry only. The number of SNPs is significantly different (greater) in the individuals with a coverage above 30X (p value: 0.007375). We note that the Wilcoxon rank-sum test with different coverage thresholds, for example 20X or 40X, are also significant (p value of 0.03323 and 1.461e−05 respectively). When we use a different metrics, the proportion of the genome with a depth of at least 15X, we observe a similar relationship with average genome coverage, but no effect of the ancestry (Additional file 9).

Read more here: Source link