Genetic determinants and absence of breast cancer in Xavante Indians in Sangradouro Reserve, Brazil

Ethics statement

Authorization from Fundação Nacional do Índio (FUNAI) was acquired after approval from the Research Ethics Committee of the Faculty of Medicine in the Federal University of Mato Grosso (UFMT), and the National Commission of Research Ethics (authorization #1004/2001). Written consents, which were recorded and archived, were acquired from all subjects prior to the study either by the individual or their legal guardian in the case of minors, as well as the use of any documents necessary for the study of the Xavante population.

Sample selection

Out of approximately 500 women in the Sangradouro Reserve (located 270 km east of Cuiabá, the capital of Mato Grosso state, in central Brazil), 182 volunteered for the project with a median age of 27 and life expectancy of 61.7 years10. We then distributed a questionnaire about comorbidity, number of offspring, breastfeeding, height, weight, and breast pathology. Blood samples from 14 of these women were used for exome-sequencing for study of this homogenous ethnic population. This cohort was selected because they are the least mixed ethnic group among those living close to cities, providing good access for the researchers to support the study, and their demographic information is listed in Supplementary Table 1. The genetic variants of this cohort were then assembled and analyzed together with the 1000 Genomes Project Phase 3 genomic variation data which includes 2542 control samples (522 white, 671 black, 515 East Asian, 348 Hispanic and 492 South Asian).

Sample collection

All DNA samples were extracted from 250 μl of whole blood using a commercially available kit according to the manufacturer’s instructions (QIAamp DNA extraction kit; Qiagen, Hilden, Germany). After extraction, the determination of the concentration of all DNA samples was carried out using the PicoGreen dsDNA quantification reagent (Molecular Probes, Eugene, OR, USA) in an Anthos Zenyth 3100 (Anthos-Labtec Instruments GmbH, Austria). The linearity of the method was verified for the high-range standard curve (2 μg/ml of Lambda DNA standard) according to the manufacturer’s recommendations prior to determination the DNA concentration.

Whole-exome sequencing

Genomic DNA from 14 female Xavante from the Mato Grosso State was prepared. A total amount of 1.0 μg genomic DNA per sample was used as input material for the DNA library preparation. Sequencing libraries were generated using Agilent SureSelect Human All Exon kit (Agilent Technologies, CA, USA) and were sequenced on an Illumina NextSeq 500 system (Illumina, San Diego, CA) by Novogene Co. Ltd (Chula Vista, California, USA). Reads were mapped to the human genome reference (GRCh38) with BWA11, HaplotypeCaller from GATK best practices workflow12 was used for variant calling, and ANNOVAR13 was used for variant annotation and effect prediction. Only variants classified as “PASS” were considered for analyses. Variants that were not reported in dbSNP v147 and did not have population frequencies reported in 1000 Genome phase 3 data were considered novel. The functional consequences of nonsynonymous SNVs and splice variants were predicted using PolyPhen-2 (Polymorphism Phenotyping v2), SIFT, LRT, Mutation Taster, Mutation Assessor, and CADD14.

Analysis of population admixture

Genomic variation data from the 1000 Genomes Project Phase 3 collection15, which includes 2548 samples from 26 populations were used for population admixture analysis together with our 14 samples. A total of 291,984 SNPs shared among these populations and also present in at least 7 of our samples were used to generate principal component analysis (PCA) plots after applying a linkage disequilibrium based variant pruner as implemented in PLINK16. Supplementary Fig. 2 shows the estimated individual ancestry proportions for K = 4 to K = 11 (parameter K describes the hypothesized number of subpopulations that make up the total population) in all 1000 Genome samples and the 14 Xavante samples using ADMIXTURE17. The fit of different values of K was assessed using cross-validation (CV) procedures, and K = 9 showed the lowest CV error (Supplementary Fig. 3).

Polygenic risk estimation

Khera et al. have developed a genome-wide polygenic score for breast cancer risk estimation which comprises 5218 common (allele frequency > 1%) variants18. This score is based on association statistics for millions of variants derived from previously published genome-wide association studies of up to 105,974 individuals with breast cancer and 122,977 control subjects19. We applied this computational algorithm to generate a polygenic score for each sample that integrates the cumulative impact of all available variants. We also analyzed germline mutation data from normal blood samples from 10 lobular and 10 ductal normal blood samples from the TCGA (The Cancer Genome Atlas) breast cancer dataset as comparison. To minimize potential confounding from whole-genome sequencing performed in separate batches for the Xavante and TCGA cohorts, we assembled a single joint variant call set across all samples in this analysis starting from raw reads. After application of stringent sequencing quality control parameters, 2171 out of the 5218 (41.6%) variants were available for scoring, and this polygenic score was calculated in each of our samples as well as the TCGA breast cancer normal control samples.

Mutation load analysis

Genomic germline variation data which include SNVs and short indels from our cohort, 1296 normal female samples from the 1000 Genomes Project, and 200 randomly selected TCGA normal blood samples (100 ductal and 100 lobular cancer subtype) were analyzed and annotated together. Exonic or splicing variants which have less than 1% population frequency and which were predicted to be damaging by at least 2 methods listed above were retained. We then compared the number of genes which carry these potenitally damaging variants among all sample populations. Since we were also interested in looking at relatedness to breast cancer risk, we further examined the mutation status of known breast cancer risk genes in all samples using the list recommended by National Breast Cancer Foundation20, which includes ATM, BARD1, BRCA1, BRCA2, BRIP1, CASP8, CDH1, CHEK2, CTLA4, CYP19A1, FGFR2, H19, LSP1, MAP3K1, MRE11A, NBN, PLAB2, PTEN, RAD51, STK11, TERT and TP53 genes.

Statistical analysis

2-sided Wilcoxon–Mann–Whitney tests (5% type I error) were used to compare of polygenic risk score and mutation load analysis between groups. Analyses were performed using R.

Read more here: Source link