Sequencing of Shorea leprosula genome
Leaf samples of S. leprosula were obtained from a reproductively mature (diameter at breast height, 50 cm) diploid tree B1_19 (DNA ID 214) grown in the Dipterocarp Arboretum, Forest Research Institute Malaysia (FRIM).
Genomic DNA was extracted from leaf samples using the 2% cetyltrimethylammonium bromide (CTAB) method90 and purified using a High Pure PCR Template Purification kit (Roche).
Library preparation and sequencing
Paired-end (170, 500, and 800 bp) and mate-pair (2 kb) genomic libraries were prepared using a TruSeq DNA Library Preparation kit (Illumina) and a Mate Pair Library Preparation kit (Illumina), respectively. Mate-pair libraries with larger insert sizes were constructed using a Nextera Mate Pair Library Preparation kit (Illumina). Ten micrograms of genomic DNA were tagmented in a 400 μl reaction and fractionated using SageELF, with the recovery of 11 fractions with 3–16+ kb. Each fraction was circularized and fragmented with a Covaris S2. Biotin-containing fragments were purified using Dynabeads M-280 streptavidin beads. Sequencing adapters (KAPA TruSeq Adapter kit) were attached using a KAPA Hyper Prep kit. The libraries were amplified for 10–13 cycles and purified with 0.8× AMpure XP. DNA libraries were then sequenced (~388× coverage) using Illumina HiSeq2000 (TruSeq libraries) and HiSeq2500 (Nextera libraries) at the Functional Genomics Center Zurich (FGCZ), University of Zurich, Switzerland (Supplementary Table 1).
Adapters and low-quality bases for all paired-end and mate-pair reads were removed using Trimmomatic91. The filtered paired-end reads of the 170 bp library were used to identify the genome size using k-mer distribution generated by Jellyfish92 that was implemented in the scripts by Joseph Ryan42. The raw R1 reads from paired-end 170 and 800 bp libraries (clipped at 95 bp, representing about 70 genome equivalents) were used to estimate the heterozygosity using KAT43 with a k-mer size of 23 nt. De novo genome assembly of all reads was performed using ALLPATHSLG assembler v5248840.
Assembly verification and assessment of the assembled genome
To validate the genome assembly, we mapped (i) the short reads used for the genome assembly, (ii) scanned the assembly for the presence of single-copy orthologs, and (iii) mapped transcriptome sequences obtained from seven organs.
Assembly verification by mapping of short reads
For each library used for genome assembly, all trimmed reads were aligned to the assembled S. leprosula genome using Burrows–Wheeler Aligner (BWA) v0.7.1293. Then, mapping ratio was calculated for each BAM file using Samtools94 with “flagstat” command.
Identification of highly conserved single-copy orthologs
BUSCO v3.1.042 was run with the Embryophyta dataset and Arabidopsis as the species for AUGUSTUS prediction (see subsection below “Protein-coding gene prediction”).
Assembly verification by mapping transcriptome sequences
For mapping transcriptome sequences, samples of seven organs (leaf bud, flower bud, flower, inner bark, small seed, large seed, and calyx) were obtained from the S. leprosula individual used for the genome sequencing (Supplementary Table 2). Total RNA was extracted from each sample using RNeasy Plant Mini Kit (Qiagen) and it was treated with Turbo DNase I (Takara). Library preparation was carried out using a TruSeq RNA Library Preparation kit (Illumina). Paired-end sequencing was conducted for all the libraries using Illumina HiSeq2000 at the FGCZ, University of Zurich, Switzerland. Adapters and low-quality bases for all paired-end reads were removed using Trimmomatic. The trimmed sequences of each library were mapped onto the assembled genome using STAR aligner v2.4.2a95, and mapping ratio was obtained from the output file of STAR.
Repeat sequence analysis
Both homology-based and de novo prediction analyses were used to identify the repeat content in the S. leprosula assembly. For the homology-based analysis, we used Repbase (version 20120418) to perform a TE search with RepeatMasker (4.0.5) and the WuBlast search engine. For the de novo prediction analysis, we used RepeatModeler to construct a TE library. Elements within the library were then classified by homology to Repbase sequences (see subsection below “Preparation of repeat sequences for evidence-based gene prediction”).
Protein-coding gene prediction
S. leprosula protein-coding genes were predicted by AUGUSTUS v3.245. For ab initio gene prediction, we used a pre-trained A. thaliana metaparameter implemented in AUGUSTUS. For the evidence-based gene prediction, we used the information of exon, intron and repeat sequences of S. leprosula as hints for the AUGUSTUS gene prediction. The details of the preparation of the hints were described in the following subsections.
Preparation of repeat sequences for evidence-based gene prediction
We used RepeatModeler to construct a de novo library of repeated sequences in the S. leprosula assembly. Then, using RepeatMasker, we generated a file containing the information of the positions of repeat sequences in the S. leprosula genome based on the RepeatModeler library. Elements within the library were then classified by homology to Repbase sequences. Finally, the hint file for repeat sequences in GFF format was prepared using the two scripts, “10_makeGffRm.pl” and “12_makeTeHints.pl”, stored in gitlab.com/rbrisk/ahalassembly.
Preparation of the exon and intron information for evidence-based gene prediction
To obtain the exon and intron hints, we used the mapping data of RNA-seq obtained from seven organs of the sequenced S. leprosula individual as described above. First, we merged all the mapping data stored in different BAM files into a single BAM file using SAMtools. Then, we prepared the intron hint file in GFF format using the, “bam2hints” script of AUGUSTUS. The exon hint file was also generated from the merged BAM file using the two AUGUSTUS scripts, “bam2wig” and “wig2hints.pl”. To conduct evidence-based gene prediction with AUGUSTUS, the three hint files (repeat sequences, intron and exon) described above were merged into a single file in GFF format.
Genome annotation completeness were assessed with BUSCO v3.1.044 using the Embryophyta odb9 dataset composed of 1440 universal Embryophyta single-copy genes. We referred to these 1440 genes as core genes in the main text.
Comparison with the proteome of Theobroma cacao
T. cacao’s gene models18 were downloaded from Phytozome 11 (phytozome.jgi.doe.gov/pz/portal.html). Then, comparison was conducted with BLASTP96 using the T. cacao proteomes as the BLAST database (E-value cutoff: 1.0E-10). Only the best hit was stored for each gene. We considered these best hits of the T. cacao genes as orthologs of the S. leprosula genes. When the T. cacao orthologs were identified by the BLASTP search, the orthologs of A. thaliana were defined based on the T. cacao–A. thaliana orthologous information provided by Phytozome 11 (Supplementary Table 4). When the T. cacao orthologs were not identified, the orthologs of A. thaliana were searched by BLASTP (E-value cutoff: 1.0E-10) using the A. thaliana proteomes obtained from TAIR 10 (www.arabidopsis.org) as the BLAST database.
Based on the result of the above BLASTP searches, we assessed synteny between the S. leprosula scaffolds and the T. cacao chromosomes using MCScanX97. Genome information of T. cacao in GFF format was also obtained from Phytozome 11 as described above, which was used as an input file for MCScanX.
Assessment of the genome assembly
Population data and other dipterocarp species
To assess whether the genome assembly could be used as a reference for the S. leprosula individuals from various populations, we checked mapping ratio, SNP positions, and admixture using the distribution-wide S. leprosula samples. Similarly, to assess whether the S. leprosula assembly could be used as a reference for aligning data from closely related species and determining their mapping ratios. For interspecific analysis, the following three Dipterocarpoideae species: S. platycarpa, D. aromatica, and N. heimii were used (Supplementary Table 7).
Sample collection and DNA extraction
Leaf samples of 19 S. leprosula individuals from different populations and three other dipterocarp species (S. platycarpa, D. aromatica, and N. heimii) were used as described in Supplementary Tables 6 and 7. Genomic DNA was extracted using the same method as described above.
Library preparation and sequencing
Paired-end genomic libraries (200 bp) were prepared using a TruSeq DNA Library Preparation kit (Illumina). DNA libraries were then sequenced (~16× coverage each) using Illumina HiSeq2000.
Mapping and SNP calling
Adapters and low-quality bases from resequencing reads were removed using Trimmomatic. All trimmed reads were then mapped and aligned to the S. leprosula assembly using BWA. Variants were called using GATK v3.598. Duplicated reads were marked using Picard 2.6.0. Within GATK, HaplotypeCaller was used to identify variants for each sample by generating an intermediate genomic variant call format (gVCF). Subsequently, gVCF files were merged using GenotypeGVCFs to produce a raw VCF file containing SNPs and INDELs. Low-quality variants were removed from the raw VCF file by applying the hard filters implemented in GATK. Variants with genotype quality (GQ) < 20 were discarded, to capture confident genotypes with 99% accuracy. INDELs were discarded and only biallelic SNPs were retained for subsequent analysis.
Conservation of the predicted genes in population samples and other dipterocarp species
To check whether the predicted genes are conserved, we used the variant data obtained by resequencing the population samples and three dipterocarp species described above. After variant calling and quality filtering, Beagle v4.199 was used for genotype phasing and imputing missing genotypes. Using in-house scripts, we aligned all genes from the phased data with reference to our predicted genes (.gff3 format). After the alignment, if a gene in a sample had less than 30% of ambiguous regions (missing data or less than 5× coverage), we considered that the gene existed in the sample. Then, if the gene was present in all the sequenced samples, it was considered as conserved.
Estimation of nucleotide diversity, Watterson’s theta and Tajima’s D for the predicted genes
To quantify genome-wide polymorphisms of S. leprosula, two measures were calculated: π, nucleotide diversity, i.e., the average number of pairwise nucleotide differences per site between sequences in a sample100; and θw, intraspecific diversity, which is based on the number of polymorphic sites in a sample of sequences but is independent of their frequency101. The analyses were implemented using the Compute program from the libsequence package102. We also calculated Tajima’s D (D), an index of frequency spectrum101.
For genetic admixture analysis, we used the raw VCF file obtained from GATK as described above. VCFtools103 was used for additional variant filtration. First, we retained variants that were successfully genotyped in 50% of individuals and had a minimum quality score of 30, a minor allele count of 3, and a minimum depth for a genotype call of 3. Subsequently, we restricted the set to variants that were called in a high percentage of individuals (95%), a set mean depth of genotypes of 20, and a minor allele frequency of 0.05. Only biallelic SNPs were retained for subsequent analysis. PLINK v1.9104 was used to convert the filtered VCF format into the PLINK format (.bed/.bim/.fam) as input for ADMIXTURE v1.3105.
Assessment of whole-genome duplication (WGD)
Collinearity dotplot between T. cacao chromosomes and S. leprosula scaffolds (Fig. 2a) were generated by VGCS v2.0106. To visualize two sets (set 1 and 2) of the collinear blocks along the T. cacao chromosomes, we changed the order of the S. leprosula scaffolds and their orientation based on the results of MCScanX under the assumption that there is complete collinearity between the two species and that each S. leprosula scaffold was used only once for the analysis (Supplementary Table 9).
Ks analysis between duplicated genes and between orthologs
To conduct Ks analysis, we first identified duplicated genes and orthologs. Based on the collinear blocks and collinear genes obtained by MCScanX, groups of genes showing a 1:1 or 1:2 relationship between T. cacao and S. leprosula were identified as orthologs. In this study, the two S. leprosula genes within each 1:2 orthologous group were identified as duplicated genes (paralogs) created by the WGD (Supplementary Tables 4 and 10), which we referred to as “WGD-retained duplicates”. In contrast, the S. leprosula genes showing a 1:1 orthologous relationship was defined as “Non-retained genes”. To understand the timing of duplications, we estimated Ks between the duplicates using the S. leprosula genome data and transcriptome data from 10 other dipterocarp species. Furthermore, to understand the timing of the divergence of the species, we estimated the Ks between orthologs using the T. cacao and data from the other dipterocarp species. The details are described in the following subsections.
Sample collection, RNA extraction, and sequencing for the 10 dipterocarp species
We collected calyxes of fruits of the following 10 dipterocarp species in FRIM: Dipterocarpus costulatus, D. aromatica, Dryobalanops oblongifolia, Hopea wightiana, N. heimii, Shorea kunstleri, Shorea sumatrana, Upuna borneensis, Vatica odorata, and Vatica umbonata (Supplementary Table 11). The calyx samples were immersed in RNAlater (Ambion) immediately after harvesting and stored at −20 °C. RNA was extracted using the CTAB method90. DNA was removed with Turbo DNase I (Takara). Purification was conducted using the RNeasy Plant Mini Kit (Qiagen). Paired-end sequencing was conducted for all the libraries using Illumina HiSeq2000.
Transcriptome assembly for the 10 dipterocarp species
Before the assembly of the transcriptome, sequences with low-quality bases were removed using Trimmomatic with a parameter set to “HEADCROP:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36”. Using the trimmed sequences, de novo transcriptome assembly by Trinity assembler (version r20140413p1)107 was conducted for each species with a default parameter. The numbers of reads before and after trimming, and those of the obtained contigs by assembly are found in Supplementary Table 11.
Identification of orthologs for the 10 dipterocarp species
Protein sequences of the genes for the 10 dipterocarp species were obtained with TransDecoder. The reciprocal BLASTP best hits (E-value cutoff: 1.0E-10) between the predicted S. leprosula and each dipterocarp species’ proteins were identified as orthologs.
Estimation of Ks
The Ks between each homologous (orthologous or paralogous) gene pair was estimated as follows. For each gene pair, first, the amino acid sequences were aligned using BLASTP. Then, the alignments were edited by retaining the aligned positions only if the three aligned positions both upstream and downstream did not contain any alignment gaps. Alignments were also retained only if they were longer than 150 aa and covered at least half of the length of both amino acid sequences. When estimating the Ks between orthologous gene pairs in orthologous groups with a 1:2 relationship, the orthologous gene pair producing the longer alignment was used. Nucleotide alignments of the coding sequences were created using the amino acid alignment as a guide, and the Ks was estimated using the coding sequences by CODEML with the Yang and Nielsen model from the PAML package108 with the following parameters: model = 0, NSsites = 0, fix_alpha = 1, alpha = 0, fix_kappa = 0, RateAncestor = 0, CodonFreq = 2. For each of the 10 other Dipterocarpaceae species, Ks was estimated between the hits identified by all against all BLASTP according to the criteria outlined above. When the pairwise Ks are estimated between all paralogs, if a particular gene is duplicated multiple times, the Ks of the same duplication events will be estimated multiple times. As such, when obtaining Ks distributions for the 10 other Dipterocarpaceae species, single Ks estimates representing each duplication event were obtained by clustering the paralogs into gene families based on the Ks estimates as previously described48.
Time estimation of the WGD event
Preparation of a gene set for phylogenetic dating of WGD
Based on the orthologs and paralogs identified above, 204 orthologous gene families were created for each S. leprosula WGD duplicate pair. Starting with S. leprosula WGD duplicate pairs with Ks = 0.2–0.6, a T. cacao ortholog was added if the Ks between the T. cacao and S. leprosula orthologs was 0.5–1.2. For both S. leprosula genes, one ortholog from either D. aromatica or D. oblongifolia, and one ortholog from U. borneensis, V. odorata, or V. umbonata were added if the Ks between the orthologs was 0.05–0.30. If multiple orthologs were present, the gene with the lowest amino acid divergence (Ka, estimated together with the Ks as described above with its S. leprosula ortholog was chosen. Finally, Gossypium raimondii genes identified as collinear orthologs with T. cacao by the PLAZA database109 were added only if the T. cacao gene corresponded to one or two G. raimondii genes. If there were two G. raimondii genes, the gene with a lower amino acid divergence with the T. cacao ortholog based on the amino acid alignment was chosen, as the alignments are more likely to be reliable. Thus, all orthologous gene families contained two S. leprosula genes, one T. cacao gene, one G. raimondii gene, two duplicates of either D. aromatica or D. oblongifolia, and two duplicates of U. borneensis, V. odorata, or V. umbonata (see Fig. 3a). For each orthologous gene family, the amino acid sequences of each gene were aligned using MAFFT version 7110 with the alignment option linsi. The alignments were cleaned by removing poorly aligned positions and divergent regions using Gblocks version 0.9b111, and gene families with a remaining alignment length of at least 100 aa were retained for further phylogenetic dating.
Phylogenetic dating of WGD
Phylogenetic dating was performed on each orthologous gene family using the BEAST package v1.8112 following the method previously described47. Briefly, an uncorrelated relaxed clock model that assumes an underlying log-normal distribution (UCLD) was used, whereas the Le-Gascuel (LG) substitution model113 with gamma-distributed rate heterogeneity across sites using four rate categories114 was set as the underlying evolutionary model. A Yule pure birth process115 was specified for the underlying tree model, and a uniform prior between 0 and 100 for the Yule birth rate was used. An exponential prior with mean 0.5 on the rate heterogeneity parameter, mean 1/3 on the standard deviation of the UCLD clock model, and a diffuse gamma prior with shape 0.001 and scale 1000 on the mean of the UCLD clock model were used. The BEAST files (.xml) that were used to run without data under the two different calibration settings (see below) are provided as Supplementary Data 3. The MCMC analysis for each orthologous gene family was run for 10 million generations while sampling every 1000 generations, resulting in a total of 10,000 samples per family. The topology was fixed according to the widely accepted phylogenetic relationship shown in Fig. 3a. The calibrations and constraints are described in detail below.
The resulting files of each family were processed with LogAnalyser, which is part of the BEAST package, with a burn-in of 1000 samples, and only those with a minimum effective sample size (ESS) of at least 200 for all statistics were retained. For the files retained, the median ages were used to represent the age of each node. Although one family in Setting 2 (FamilyID 85 in Supplementary Table 12) was removed as it had very low (<200) ESS for multiple statistics, all the remaining families had an ESS of more than 200 for all statistics. Subsequently, for nodes 1–5, age distributions of the median age estimates of each family were obtained. Then, the kernel density estimate (KDE) of the ages of all the families was calculated using the R density function, and the mode was used as the consensus age of each node. Finally, in order to obtain 95% CIs of the consensus age of each node, 1000 bootstrap datasets of the age estimates of each family were created, and the mode of the KDE was calculated for each bootstrap dataset, as described in a previous study47. Then, the modes of the 26th and 974th bootstrap density estimate (ranked in order of increasing value of their mode) were taken as the lower and higher 95% CI boundary, respectively.
Calibrations and constraints for phylogenetic dating of WGD
The nodes corresponding to the divergence of T. cacao and G. raimondii (node 2) and the divergence of Shorea and Dryobalanops (node 5) were both constrained based on fossil records. A minimum age of 55.8 Ma was assigned to the T. cacao–G. raimondii node (node 2) based on the fossil from the middle-to-late Paleocene that has been attributed to the Eumalvoideae22,116. A minimum age of 34 Ma was assigned to the Shorea–Dryobalanops node based on fossils from the late Eocene attributed to Shorea74, which, to our knowledge, were the earliest fossils that could be confidently attributed to Shorea. Log-normal prior distributions with the means equal to the minimum fossil age plus 10% were assigned to the T. cacao-G. raimondii and Shorea-Dryobalanops nodes. These correspond to 61.38 Ma for T. cacao–G. raimondii (mean = 1.719, offset = 55.8) and 37.4 Ma for Shorea–Dryobalanops (mean = 1.22, offset = 34), with a standard deviation of 1 for these two nodes63. Although there were no appropriate fossil calibrations that could be assigned to the root node, the divergence between Dipterocarpaceae and Malvaceae has been estimated as ∼70–80 Ma by previous phylogenetic dating studies22,64. To incorporate this information, a normal prior distribution with a mean of 75 Ma and a standard deviation of 8 was assigned to this node as a secondary calibration.
Considering the various uncertainties associated with the fossil and secondary calibrations, an alternative set of calibrations (Setting 2) was used to perform phylogenetic dating. In particular, we considered that the settings described above (Setting 1) may be slightly biased toward producing younger age estimates; therefore, we applied a setting that allows each node to explore age estimates that are older. For instance, the true divergence date can potentially be a lot older than the fossil records used as lower bounds. In addition, a fossil from, e.g., the late Eocene can be anywhere between ∼34 and ∼41 Ma. Thus, assigning a prior distribution with a mean that is only a few million years older than the youngest possible date of the fossil can be argued as being rather restrictive. Similarly, we considered the possibility that previous estimates of the divergence dates between Dipterocarpaceae and Malvaceae are underestimated due to the limited sampling of Dipterocarpaceae and/or the low substitution rates of woody lineages such as Dipterocarpaceae117. In fact, the posterior age distribution of the root node from Setting 1 was a lot older than the prior age distribution. As such, a log-normal prior distribution with the mean age corresponding to 70.7 Ma and a standard deviation of 0.25 (offset = 55.8, mean = 2.7), based on the phylogenetic dating results of Malvaceae118, was assigned to the T. cacao–G. raimondii node, a log-normal prior distribution with the mean age corresponding to 42 Ma and a standard deviation of 0.5 (offset = 34, mean = 2.08) was assigned to the Shorea–Dryobalanops node, and a log-normal distribution with the mean age corresponding to 78.2 Ma and a standard deviation of 0.3 (offset = 55.8, mean = 3.11) was assigned to the root node. The marginal prior densities for each node based on running the MCMC sampler without data are shown in Supplementary Fig. 9 for both parameter settings.
One recent study performed the most comprehensive molecular dating of Dipterocarpaceae to date23. These authors chose not to use any fossil constraints citing difficulties to assign dipterocarp fossils to particular clades, and argued that the fossil ascribed to Shorea74 that we used as a fossil constraint is likely to be of another a species within Anthoshorea. This does not affect our result as Shorea and Anthoshorea share a more recent common ancestor the node 5 that we applied the fossil constraint to. These authors instead used a log-normal distribution with a mean of 87.5 Ma as a calibration point to the most recent common ancestor of Dipterocarpoideae and Sarcolaenaceae, which is more recent than the root node in our study. This is based on a widely cited assumption among dipterocarp researchers that Sarcolaenaceae, which is endemic to Madagascar, diverged from its sister species due to the separation of India and Madagascar ~87.6 Ma119. We chose not to incorporate this age as prior information considering the results of many other plant groups suggesting an important role of dispersal over vicariance in explaining trans-oceanic distributions (see “Discussion”). We note nevertheless that this age is compatible with our results if we assume that this divergence occurred shortly after the divergence of Dipterocarpaceae and Malvaceae.
We also note that some studies have assumed that Vateriopsis, which is endemic to the Seychelles, diverged from its sister lineage containing Vatica, Upuna, and Vateria ~63 Ma by the separation of the Seychelles and India63, leading to much earlier estimates for the divergence between the lineages of Dipterocarpoideae (e.g., ∼80 Ma for node 4 and ∼55 Ma for node 5 or ∼95 Ma for node 4 and ∼70 Ma for node 5)63,120, compared with our estimates of Fig. 3a (~42–50 Ma for node 4 and ~36–40 Ma for node 5). By contrast, the aforementioned comprehensive molecular dating study of Dipterocarpacae reported mean age estimates of 54.9 Ma for node 4 and 43.3 Ma for node 5, but both with posterior density intervals of ~30 Ma23. These estimates are similar to our estimates, and these authors suggested that the divergence of Vateriopsis occurred most likely by long-distance dispersal rather than vicariance.
Characterization of duplicated genes
Previous studies suggested that ancient homologs tend to show slower evolutionary rates at nonsynonymous sites49,50. We assessed whether the Ka, Ks, and Ka/Ks of the WGD-retained duplicates were significantly smaller than those of the non-retained genes between S. leprosula and T. cacao (Malvaceae), S. leprosula and H. wightiana (Dipterocarpaceae), and S. leprosula and U. borneensis (Dipterocarpaceae) using the same approach with CODEML in the PAML package explained above. The distributions of Ka, Ks, and Ka/Ks estimates between S. leprosula and T. cacao were compared between the WGD-retained duplicates and the non-retained genes. We also compared the Ka, Ks, and Ka/Ks of upregulated genes under no-irrigation treatment in section “No-irrigation treatment” below that are the WGD-retained duplicates (“WGD-retained drought-up genes”) with those of the non-retained genes. Statistical analyses were conducted using one-sided Mann–Whitney U tests considering multiple comparisons (P-value cutoff: 0.05 after Bonferroni correction).
Gene ontology (GO) enrichment test for the WGD-retained duplicates
For GO enrichment test, we used the GO information of the A. thaliana orthologs in Supplementary Table 4. The A. thaliana GO terms were downloaded from TAIR 10 (www.arabidopsis.org) on 21 November 2017. The GO enrichment analysis was performed using the BioConductor package topGO121 in R. For enrichment analysis, we adopted the “elim” algorithm together with Fisher’s statistic to test the functions of the retained duplicated genes. The “elim” algorithm scores P-values by considering the topology of GO graphs122. We listed the top 40 significant GO terms identified by the “elim” algorithm method and the P-values obtained (P-value cutoff: 0.05). To consider that different scoring methods may affect the result of significance, we also evaluated the significance of the enriched GO terms by Fisher’s exact test (“classic” algorithm in topGO) and multiple test corrections by false discovery rate (FDR cutoff: 0.05) using the Benjamini-Hochberg procedure.
GO enrichment test for tandemly duplicated genes
To compare with the result of GO enrichment test in the WGD-retained duplicates, we also conducted GO enrichment test for tandemly duplicated genes in the S. leprosula genome. For this purpose, we first identified the tandemly duplicated genes by using the following criteria: (i) neighboring genes on the same scaffold corresponded to the same gene in T. cacao as a result of BLASTP (see above subsection “Comparison with the proteome of Theobroma cacao”); (ii) one of the neighboring genes showed a 1:1 or 1:2 syntenic orthologous relationship with a T. cacao gene (i.e., WGD-retained duplicates or non-retained genes). In this analysis, we considered the genes that were not the tandemly duplicated genes and showed syntenic relationship with T. cacao genes as non-tandem duplicates and used them as a control of comparisons. The procedures of GO enrichment tests for the tandemly duplicated genes were the same as those described above.
Experimental condition of the no water treatment
To confirm the functionality of the duplicated drought-responsive genes in the WGD-retained duplicates, an experiment was conducted on six S. leprosula seedlings grown in the nursery of the Forest Research Institute Malaysia (FRIM). The seedlings were about 2 years old with an average height of 36 cm and an average collar diameter of 3.55 mm. All the seedlings were transferred to a plant growth chamber (Percival PGC-15) with the following conditions—day: 29 °C, 75% humidity; night: 26 °C, 75% humidity; day/night cycle: 12/12 h. Two types of treatment were applied: 50 mL of water daily (control) and no irrigation (artificial drought). Each treatment had three replicates and lasted for 9 days.
Sample collection, RNA extraction, and sequencing for RNA-seq data
Leaves were sampled at day 0 (before no-irrigation treatment, at 09:00) and at day 7 (during the no-irrigation treatment, at 0900). The leaves were immersed in RNAlater (Ambion) immediately after harvesting and stored at −20 °C. RNA was extracted from the leaves using the same method described above. Library preparation was carried out using an Illumina TruSeq Stranded mRNA library preparation kit in accordance with the manufacturer’s recommendations. Paired-end 125 bp sequencing using an Illumina HiSeq2500.
Analysis of RNA-seq data
All data analysis was performed using the SUSHI pipeline123. Paired-end raw sequence reads were combined and mapped onto the S. leprosula genome and the annotation file using STAR aligner87. The mapped reads (Supplementary Table 18) were then counted using the FeatureCounts function of Rsubread124. A quality control step was subsequently performed on the counted reads using CountQCApp from SUSHI. We also checked for the presence of contamination or ribosomal RNA content on our reads using FastqScreenApp from SUSHI. Finally, the genes that were differentially expressed between the two time points were detected using the BioConductor package edgeR125 in R which based on a negative binomial distribution to model the raw read counts in a gene-wise manner and followed by Trimmed Mean of M-values (TMM) method for the sequence depth normalization126.
Using the output from edgeR, we split the data into two groups of upregulated and downregulated genes in response to the dry treatment. We filtered those genes based on the significance-level false discovery rate (FDR < 0.05), obtaining 1200 upregulated genes and 914 downregulated genes. Both the upregulated and the downregulated genes underwent an enrichment analysis using the BioConductor package topGO in R with the same procedures described above.
Comparison between drought-response and duplicated genes
To test whether the drought-response genes obtained in above are enriched in the S. leprosula WGD-retained duplicates, we conducted Fisher’s exact tests using the “fisher.test” function in R. Bonferroni corrections were conducted by considering multiple comparisons.
As a comparison, we also tested whether the drought-response genes are enriched in the tandemly duplicated genes in S. leprosula, by performing Fisher’s exact tests and Bonferroni corrections as described above.
Distributions of Shorea leprosula and Shorea roxburghii, and the precipitation in their habitats
We downloaded the distribution data of S. leprosula and S. roxburghii (a closely related Shorea species to S. leprosula that grows in a more seasonal climate) from the Global Biodiversity Information Facility (GBIF) (www.gbif.org) using the “gbif” function in the R package, “dismo”. We further downloaded the precipitation data of the driest month (BIO14) from WorldClim (worldclim.org) at the resolution of 2.5 min by using the “getData” function in the R package, “raster” for every site of the two species in the GBIF data in the region ranging from −6° to 22° of latitude and from 90° to 120° of longitude. We combined these data to assess the distribution of the driest month for these two species growing in contrasting climates. We further analyzed 30-day cumulative rainfall for 13 years and 2 months from Danum Valley Field Centre in Sabah, Borneo (data downloaded from searrp.org) and Pasoh Forest Reserve, Peninsular Malaysia127 to examine the temporal rainfall and drought patterns of two sites within the distribution of S. leprosula.
Statistics and reproducibility
The data of this genome study was derived from a single diploid individual S. leprosula tree B1_19 (DNA ID 214) located at the Dipterocarp Arboretum at Forest Research Institute Malaysia (FRIM). RNA-seq reads obtained from seven organs used for genome annotation were derived from the same tree. Resequencing data were derived from 19 S. leprosula individuals obtained across its distribution range in Southeast Asia (Peninsular Malaysia, Borneo, Kalimantan, and Sumatra) and three other closely related dipterocarp species (S. platycarpa, D. aromatica, and N. heimii). RNA-seq of 10 other dipterocarp species were obtained from Dipterocarp Arboretum at FRIM for comparative genomics and molecular dating analysis. No-irrigation treatment were conducted using 2 years old S. leprosula seedlings in a plant growth chamber (Percival PGC-15). Two types of treatments were applied: 50 mL of water daily (control) and no-irrigation (artificial drought). Each treatment had three replicates. All statistical tests were conducted using publicly available programs and packages as described in sections under “Methods”. Reproducibility can be accomplished by following the sample used and methods outlined above. Statistical analysis using R were described above in each section.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Read more here: Source link