Understanding signatures of positive natural selection in human zinc transporter genes

Datasets and populations

We first compiled whole-genome sequencing data to analyze the patterns of variation in ZTGs on two geographical levels. Thus, we explored a worldwide dataset of 2,328 unrelated individuals representing 24 populations across Africa (AFR), Europe (EUR), East Asia (EAS), South Asia (SAS) and America (AMR), denoted as the 1000GP dataset (for details, see “Methods” section, Supplementary Table S1). We also gathered a South Asian dataset to further describe ZTG variation in a geographical region with well-recognized zinc deficiency in soil19 (for details, see Supplementary Table S2). After PCA and ADMIXTURE analyses, eight genetic homogenous population groups with sample sizes ≥ 30 from this latter dataset were selected to be analyzed for signatures of positive selection (Fig. 1, Supplementary Figs. S1, S2). Thus, the final curated South Asian dataset comprised a total of 1353 individuals (979 of them South Asian) and included four external reference populations (Yoruba in Ibadan, Nigeria (YRI); Utah residents with Northern and Western European ancestry (CEU); Han Chinese in Beijing, China (CHB); and individuals with Mexican Ancestry (MXL)), populations from Pakistan (PAK), Bangladesh (BEB) and Sri Lanka (STU), and five Indian groups: tribal populations speaking Austroasiatic languages (T-AA), Dravidian-speaking tribal populations (T-DR), Tibeto-Burman-speaking tribal populations (T-TB), non-tribal populations speaking Dravidian languages (nT-DR), and non-tribal populations speaking Indo-European languages (nT-IE) (Fig. 1, Supplementary Table S3).

Figure 1
figure 1

Population structure analysis of the South Asian dataset. (a) Principal Component Analysis (PCA) of the curated dataset with (left) and without (right) external reference populations. (b) ADMIXTURE analysis of the curated dataset, K = 6 (CV error = 0.4274). For populations from the 1000 GP belonging to the South Asian region, a subset of 15 samples is represented. Population group abbreviations: T-AA tribal populations speaking Austroasiatic languages, T-DR Dravidian-speaking tribal populations, T-TB Tibeto-Burman-speaking tribal populations, nT-DR non-tribal populations speaking Dravidian languages, nT-IE non-tribal populations speaking Indo-European languages, PAK Pakistan, BEB Bangladesh, STU Sri Lanka, YRI Yoruba in Ibadan, Nigeria, Africa, CEU Utah residents with Northern and Western European ancestry, CHB Han Chinese in Beijing, China, MXL individuals with Mexican ancestry from Los Angeles, California. The full names for all populations within groups are available in Supplementary Table S3.

Population differentiation in ZTGs

We investigated the patterns of genetic differentiation of the complete set of ZTGs, comparing each of the 24 worldwide populations and the five main geographical regions of the 1000GP dataset. For that, we computed SNP FST values for all pairs of populations and geographical regions in the 1000GP dataset and extracted both the highest FST value (Max FST) and the weighted average FST value (WA FST) per gene. Subsequently, a rank test was used to assess whether the mean WA FST (and mean Max FST) of the complete set of ZTGs differed from genome-wide expectations using 10,000 resamplings of 24 randomly matched genes (for details, see “Methods” section, Supplementary Figs. S3, S4). Notably, the ZTGs showed a consistent pattern of higher WA FST (and higher Max FST) than random genome-wide gene sets in all continental pairwise comparisons with Africa, except for East Asia (Fig. 2, Supplementary Fig. S5). Moreover, several individual populations within each geographical region clearly replicate the high genetic differentiation of ZTGs when compared with African populations (Fig. 2, Supplementary Fig. S5, Table S5). Similarly, a greater proportion of highly differentiated SNPs was found in ZTGs when compared to sets of randomly matched genes in several African versus non-African population comparisons. However, in the comparison of global geographical regions, a greater proportion of highly differentiated SNPs in ZTGs was only detected when comparing East Asia and Africa (Supplementary Fig. S6, Table S5). As for the South Asian dataset, ZTGs had a greater proportion of highly differentiated SNPs than random genes in two groups of Indian tribal populations (T-DR and T-TB) when compared with the YRI population. In contrast, analysis of Max FST and WA FST did not reveal any greater genetic differentiation in the complete set of ZTGs when the South Asian and YRI populations were compared (Supplementary Table S6).

Figure 2
figure 2

ZTGs present higher differentiation than randomly matched genes in African versus non-African comparisons. (a) For each FST comparison between geographical regions in the 1000GP dataset, p-values of the corresponding rank test when comparing the average WA FST value of the whole set of 24 ZTGs with that of 10,000 resamplings of 24 genome-wide matched genes. (b) For each FST population comparison in the 1000GP dataset, p-values of the corresponding rank test when comparing the average WA FST value of the whole set of 24 ZTGs with that of 10,000 resamplings of 24 genome-wide matched genes.

The potential contribution of each individual ZTG to the genetic differentiation in the whole gene set was then examined by (i) comparing the FST of each ZTG with the corresponding genome-wide distribution of FST gene values obtained from 5146 reference genes (for details, see “Methods” section, Supplementary Fig. S4), and (ii) for the 1000GP dataset, testing the genetic differentiation of the complete set of ZTGs with the same procedures as described above but removing one of the 24 ZTGs each time. For most African versus non-African comparisons in the 1000GP dataset, SLC30A9 and SLC39A5 appeared as clear outliers in the WA FST analysis, whereas SLC39A4, SLC39A11, and again SLC30A9 were the most consistent outliers when considering the Max FST per gene. SLC39A11 also appeared as a Max FST outlier in several non-African pairwise comparisons as well as when comparing the geographical regions South Asia and Europe (Supplementary Table S5). The removal of any individual ZTG in the 1000GP dataset did not affect the general trend of higher differentiation in ZTGs when comparing African with non-African populations (results not shown) but confirmed that SLC30A9 was clearly the gene contributing most to the observed pattern (for the effects of its removal, see Supplementary Fig. S7). In the South Asian dataset, SLC30A9 and SLC39A5 were replicated as the two main outliers in the WA FST analysis, while SLC30A9 and SLC39A11 were consistent outliers in several of the South Asian groups when considering the Max FST per gene (Supplementary Table S6). However, the Max FST signal for SLC39A4 was not replicated because the SNP responsible was filtered out during the merging of the South Asian dataset.

Other signals of positive selection in ZTGs

Additional signatures of positive selection were explored using the cross-population Extended Haplotype Homozygosity (XP-EHH) and the integrated Haplotype Score (iHS) statistics as well as the Tajima’s D test. As above, we first tested whether the distribution of the statistics computed across the whole set of 24 ZTGs differed from genome-wide randomly matched genes and then determined which individual ZTGs contributed most to the signals.

In the 1000GP dataset, we detected higher Max XP-EHH values in ZTGs than in randomly matched genes in three (CEU, GIH, and MXL) out of the four non-African populations analyzed when using YRI as the reference population. Moreover, the signature of higher Max XP-EHH values observed for the whole set of ZTGs in the GIH population was replicated when using CHB as the reference population. For CEU (and CHB) we also observed a higher proportion of SNPs with outlier XP-EHH values when using MXL as the reference (Supplementary Table S7). However, we did not detect higher Avg XP-EHH values in ZTGs in any of these 1000GP dataset populations. SLC30A9, SLC30A10, SLC39A8, and SLC39A11 were the ZTGs that most contributed to the Max XP-EHH signal (Supplementary Table S7). In agreement with the XP-EHH results in the GIH population, ZTGs had consistently higher proportions of SNPs with outlier XP-EHH values than randomly matched genes in four South Asian groups (i.e., BEB, nT-DR, nT-IE, and PAK) when using CHB as the reference. As previously observed in the GIH population, SLC30A10 and SLC39A11 were the ZTGs that most contributed to the outlier Max XP-EHH signals in these South Asian groups (Supplementary Table S8).

Interestingly, higher iHS values were obtained for ZTGs than for genome-wide randomly matched genes, but only in some East and South Asian populations of the 1000GP dataset (Supplementary Table S9). In particular, ZTGs had higher Avg iHS values in GIH and STU, as well as higher Max iHS in GIH, STU, PJL, and KHV. Notably, the mean Max iHS across the 24 ZTGs in these populations ranged from 2.12 to 2.22, |iHS|> 2 being the threshold usually considered as evidence of recent positive selection at a given locus43. When analyzing the genes contributing most to the iHS signatures, SLC30A9 was a clear outlier for Avg iHS, whereas SLC30A10 and particularly SLC39A11 were the genes that most consistently contributed to the increased Max iHS values of ZTGs. In the GIH population, we also detected a higher proportion of top 1% iHS values for the complete set of ZTGs when compared to randomly matched genes and up to three further individual ZTGs contributing to the Max iHS signature of ZTGs (SLC30A9, SLC39A10, and SLC39A12; Supplementary Table S9). Accordingly, most population groups of the South Asian dataset showed consistently higher iHS values across the complete set of 24 ZTGs compared to randomly matched genes (Table 1). Notably, in the nT-IE population group and the PAK and STU populations, ZTGs displayed not only a higher proportion of SNPs with outlier iHS values but also higher Max iHS and Avg iHS values than randomly matched genes. As above, SLC30A9 was the ZTG contributing most to the Avg iHS signal, while SLC30A9, SLC30A10, SLC39A10, and SLC39A11 were found to consistently contribute to the Max iHS signature detected in the South Asian dataset. Moreover, the Max iHS value in these outlier genes was always ≥ 3.19 (Supplementary Table S10).

Table 1 ZTGs tend to have higher iHS values than randomly matched genes in several South Asian population groups.

When exploring the site frequency spectrum of the whole set of ZTGs across the 24 populations of the 1000GP dataset, we found no enrichment of ZTGs towards Tajima’s D negative values compared to genome-wide randomly matched genes (Supplementary Table S11). However, some of the ZTGs that contribute to the higher XP-EHH and iHS signals detected in ZTGs were also the top outlier genes in the genome-wide distribution of Tajima’s D values obtained with all 5146 reference genes. For instance, SLC39A5 was detected as a clear Tajima’s D outlier in most non-African populations, especially in those of South Asia, whereas SLC30A9 was only found as an outlier for negative values of Tajima’s D in some East Asian populations. Finally, SLC39A7 was detected as a consistent Tajima’s D outlier for negative values in all African populations (Supplementary Table S11).

Replicating evidence for polygenic adaptation in ZTGs

We also used the SUMSTAT test44 to assess whether the sum of the max gene scores of all 24 ZTGs for each statistic replicated their corresponding enrichments in signals of positive selection when compared to randomly matched genes, while also controlling for SNP density as in Daub et al.40. Although this approach was much more statistically stringent, the complete set of ZTGs retained a clear enrichment for higher population differentiation in several African versus non-African population comparisons (Europe, South Asia, and America; Supplementary Fig. S8), and for stronger Max iHS signals in the STU population of the 1000GP dataset (Supplementary Table S9) as well as the C-DR, C-IE, PAK and STU populations of the South Asian dataset (Table 1, Supplementary Table S10). On the contrary, although we detected higher average Max XP-EHH values across ZTGs compared to randomly matched genes in CHB, GIH and MXL, no enrichment for Max XP-EHH signals was replicated with the SUMSTAT test (Supplementary Tables S7, S8).

Identification of candidate genes and variants for selection

We then looked for individual ZTGs that were recurrently identified as outliers across the FST, XP-EHH, iHS, and Tajima’s D analyses in comparison with reference genes matched for gene length, recombination, and gene content (Supplementary Tables S5S7). Out of the 24 human ZTGs, six displayed consistent patterns of variation indicative of strong positive selection across several populations in distinct geographical regions. As expected, these outlier ZTGs comprise previously identified targets for selection in East Asia (SLC30A9, SLC39A8)8,33, sub-Saharan Africa (SLC39A4)36, or found to be widespread across continents (SLC39A11)8. However, we also identified two additional ZTGs with distinctive levels of population differentiation, deviations in the site frequency spectrum, and unusually extended haplotypes, which are therefore proposed as new putative candidates for positive selection in Africa (SLC39A7) and South Asia (SLC39A5) (for details, see Supplementary Note 1).

Furthermore, we also explored the contribution of individual SNPs to the specific patterns of population differentiation and signals of positive selection detected in the ZTGs. For that, we examined the SNP values for each statistic and focused on those with a score above the 99th percentile in at least 95% of the 10,000 resampling sets of 24 randomly matched genes in each individual population or population comparison analyzed (Supplementary Fig. S4). After their corresponding annotation, we selected as candidate SNPs for positive selection those that presented at least one indicator of functionality and/or evolutionary conservation (complete lists for each dataset are provided in Supplementary Tables S12, S13).

Among the genetic variants contributing the most to the extreme African versus non-African differentiation in the SLC30A9 gene, we found many linked eQTLs, one non-synonymous SNP (rs1047626), and five SNPs with CADD Phred Scores greater than 12 (rs2660319, rs15857, rs55835604, rs4861014, rs7660233). The greatest allele frequency differences for these candidate SNPs are found between East Asia and Africa, whereas the intermediate frequencies observed in the South Asian population groups probably explain and allow the capture of the same adaptive event with the iHS statistic. Although we detected no obvious candidate SNP for SLC39A5 and SLC39A7, for SLC39A11 we identified two intronic SNPs with CADD Phred Scores greater than 12 (rs6501559, rs8068946) and several eQTLs presenting extreme allele frequency differences between African and non-Africans. Similarly, most of the SLC39A8 outlier XP-EHH signals identified in CHB (when using MXL as the reference) and several of the SNPs unusually differentiated between some East Asian and South Asian populations were also identified as eQTLs for the gene. At the SLC30A10 region, we identified several eQTLs highly differentiated between GWD and CLM and contributing to the iHS signal detected in Europe and South Asia, but also variants at an intronic ncRNA producing the XP-EHH signatures detected in Europe and in several South Asian groups (when using CHB as the reference). Furthermore, several eQTLs for SLC30A2, SLC30A8, SLC39A3, SLC39A6, SLC39A9, SLC39A10, SLC39A12, and SLC39A14 were identified as additional candidate SNPs probably contributing to the African versus non-African population differentiation in ZTGs. In contrast, the top Max FST gene values detected for SLC39A4 when comparing any African and non-African population pair are caused by the extreme population differentiation of the L372V non-synonymous substitution (rs1871534), which presents a CADD Phred Score of 24.10.

Zinc content of soil as an environmental selective pressure

As human zinc deficiency in India is well recognized22, we also used the South Asian dataset to investigate correlations between the zinc content of soil and the SNP genotype frequencies of ZTGs while considering the genetic structure of the analyzed South Asian population groups. Samβada’s multivariate analysis identified 66 genotypes at 59 SNPs from six ZTGs (SLC30A3, SLC30A4, SLC30A8, SLS39A7, SLC39A9, and SLC39A11) significantly correlated with soil zinc content (Supplementary Table S14). Up to 35 out of these 59 SNPs were eQTLs for SLC39A9, including a UTR5 variant (rs2168241) and an intronic SNP with a CADD Phred Score of 14.97 (rs17106979). One non-synonymous SNP with a CADD Phred Score of 12.79 was also detected for SLC39A7 (rs1547387). Moreover, two of the SNPs with genotypes significantly correlating with soil zinc content (rs3802177 and rs11558471) have been previously associated with type 2 diabetes45, fasting plasma and blood glucose46, glycated hemoglobin levels47, proinsulin levels48, and body mass index49 and are located at the 3′ UTR of SLC30A8, which encodes an islet zinc transporter necessary for proper insulin secretion. In particular, when directly analyzing the correlation between allele frequencies and the zinc content of soil, the derived alleles of these two SNPs, which are in high linkage disequilibrium with each other (r2 > 0.85) and display almost identical frequencies worldwide, both show a significant positive correlation with zinc deficiency in soils (Spearman ρ = 0.60, p = 0.0061 for rs3802177; and ρ = 0.60, p = 0.0065 for rs11558471, respectively; Fig. 3).

Figure 3
figure 3

Allele frequencies for two UTR variants at the SLC30A8 gene. (a) Frequencies for the derived A allele at rs3802177 (left) and the derived G allele at rs11558471 (right) in present-day populations from India plotted against zinc deficiency in soil in their assigned locations (for details, see “Methods” section). Number of individuals per population: GIH (101), LOD (11), RTH (11), PNY (11), ORN (15), BIR (12), DOR (12), ABM (11), ITU (127), SRB (13), CHK (11), TOD (16), IYE (13), MOG (17), SSI (35), BLR (34), MAA (34), AGH (14) and MHR (19). The full names for all populations and their corresponding allele frequencies are available in Supplementary Tables S3 and S15. (b) Worldwide allele frequency distribution for rs3802177 (left) and rs11558471 (right) in the populations of the 1000GP dataset. Blue and yellow indicate the minor and most common allele, respectively. Plots were obtained with the Geography of Genetic Variants Browser (version 0.4 (beta); popgen.uchicago.edu/ggv/).

Read more here: Source link