INTRODUCTION
The two initial tasks for any microbiome sequencing data set are (i) taxonomic analysis, that is, to determine which organisms are present in a microbiome sample, and (ii) functional analysis to determine which genes and pathways are present. Task i can be addressed, to a degree, using amplicon sequencing (targeting 16S rRNA, 18S rRNA, or internal transcribed sequencing, for example). However, a species- or strain-level taxonomic analysis, and task ii, both are best addressed using whole-genome shotgun sequencing, that is, metagenomics proper.
RESULTS
To allow a fair comparison between the two runs, when aligning against the NCBI-nr database, throughout this study, we only aligned against the prokaryotic entries of the NCBI-nr database.
Performance on a mock community.
FIG 1
These results indicate that the use of the AnnoTree protein database, in place of the NCBI-nr protein database, is worth pursuing when the goal is to determine the prokaryotic content of a sample.
Performance on 10 different data sets.
TABLE 1
Data set | Accession no. | Total no. of reads | Reads with DIAMOND alignments | Ratio | |||
---|---|---|---|---|---|---|---|
AnnoTree (no.) | % | NCBI-nr (no.) | % | ||||
River1 | ERR466320 | 646,178 | 410,118 | 63.5 | 406,913 | 63.0 | 1.0 |
River2 | SRR8859111 | 129,753,222 | 90,535,941 | 69.8 | 88,403,713 | 68.1 | 1.0 |
Seagrass | SRR6350025 | 98,260,754 | 36,053,215 | 36.7 | 33,717,202 | 34.3 | 1.1 |
Skin | ERR2538467 | 22,827,626 | 13,403,495 | 58.7 | 14,122,490 | 61.9 | 0.9 |
Stool | ERR2641811 | 33,214,614 | 29,132,562 | 87.7 | 30,101,313 | 90.6 | 1.0 |
Soil | SRR7521491 | 97,595,185 | 10,992,188 | 11.3 | 7,264,223 | 7.4 | 1.5 |
Thermal Pools | SRR6344961 | 52,908,626 | 15,751,382 | 29.8 | 16,625,446 | 31.4 | 0.9 |
Bioreactor1 | SRR9831403 | 99,998,110 | 73,151,916 | 73.1 | 72,806,515 | 72.8 | 1.0 |
Bioreactor2 | SRR8313048 | 44,258,996 | 36,608,649 | 82.7 | 37,477,641 | 84.7 | 1.0 |
Bioreactor3b | SRR8305972 | 694,827 | 613,958 | 88.4 | 616,536 | 88.7 | 1.0 |
Total | 580,158,138 | 306,653,424 | 52.86 | 301,541,992 | 51.98 | 1.02 |
a
For both the AnnoTree and NCBI-nr protein databases, we report the number and percentage of reads that obtained an alignment using DIAMOND. We also report the ratio between the two numbers.
Out of the total of ≈580 million reads, DIAMOND aligns ≈307 million (≈52.9%) reads against the AnnoTree database and ≈302 million (≈52%) reads against the NCBI-nr database, that is, nearly 1% more reads against the former database, although the latter database contains more than twice as many entries.
Taxonomic binning.
TABLE 2
Classification | AnnoTree run | NCBI-nr run | Ratio | ||||
---|---|---|---|---|---|---|---|
Assigned | % of R | % of Al | Assigned | % of R | % of Al | ||
NCBI taxonomy | 305,150,157 | 52.6 | 99.5 | 297,539,333 | 51.3 | 98.7 | 1.0 |
GTDB taxonomy | 303,770,449 | 52.4 | 99.1 | 282,269,816 | 48.6 | 93.6 | 1.1 |
EC | 78,874,545 | 13.6 | 25.7 | 76,552,285 | 13.2 | 25.4 | 1.0 |
eggNOG | 95,932,149 | 16.5 | 31.3 | 87,131,284 | 15.0 | 28.9 | 1.1 |
InterPro | 142,250,858 | 24.5 | 46.4 | 143,885,580 | 24.8 | 47.7 | 1.0 |
KEGG | 209,371,499 | 36.1 | 68.3 | 123,130,673 | 21.2 | 40.8 | 1.7 |
SEED | 102,452,692 | 17.7 | 33.4 | 100,615,086 | 17.3 | 33.4 | 1.0 |
a
FIG 2
FIG 3
Functional binning.
FIG 4
Running time.
As the AnnoTree protein database is less than one-quarter the size of the full NCBI-nr protein database, using the former during the alignment step of the DIAMOND+MEGAN pipeline will speed up the analysis. This will be offset, very slightly, by the fact that the number of aligned reads will be slightly higher. In the analysis performed here, we only aligned against the prokaryotic content of NCBI-nr, which contains approximately twice as many sequences as the AnnoTree protein database.
TABLE 3
DIAMOND | Meganizer | DIAMOND+MEGAN | ||||||
---|---|---|---|---|---|---|---|---|
NCBI-nr | AnnoTree | Ratio | NCBI-nr | AnnoTree | Ratio | NCBI-nr | AnnoTree | Ratio |
125,288 min | 61,443 min | 2.0 | 2,241 min | 2,404 min | 0.9 | 127,529 min | 63,847 min | 2.0 |
a
Summarizing all 10 data sets, we show the CPU time used for running DIAMOND, Meganizer, and both combined during either an NCBI-nr run (restricted to prokaryotic sequences) or an AnnoTree run of the DIAMOND+MEGAN pipeline.
Performing an NCBI run using the full NCBI-nr database, not just the prokaryotic part, on each of the 10 data sets takes ≈3 times as long as an AnnoTree run, with an ≈4% increase of assignments to the NCBI taxonomy, on average.
DISCUSSION
As an alternative, projects focusing on the prokaryotic content of microbiome samples can make use of the GTDB taxonomy and the AnnoTree protein database, which are both explicitly designed for this. As the AnnoTree protein database is only 1/4 the size of the NCBI-nr database, DIAMOND alignment of reads against the AnnoTree protein database takes at most only half as much time while providing a superior assignment rate.
In this study, we illustrated the performance of AnnoTree runs using 10 example data sets from different environments. The results on these examples confirm that using AnnoTree is a useful alternative to using NCBI-nr.
MATERIALS AND METHODS
Data sets.
TABLE 4
Data set | SRA run ID | Platform | Layout | Total no. of reads |
---|---|---|---|---|
River1 | ERR466320 | LS454 | Single | 646,178 |
River2 | SRR8859111 | Illumina | Paired | 129,753,222 |
Seagrass | SRR6350025 | Illumina | Paired | 98,260,754 |
Skin | ERR2538467 | Illumina | Single | 22,827,626 |
Stool | ERR2641811 | Illumina | Paired | 33,214,614 |
Soil | SRR7521491 | Illumina | Paired | 97,595,185 |
Thermal pools | SRR6344961 | Illumina | Paired | 52,908,626 |
Bioreactor1 | SRR9831403 | Illumina | Paired | 99,998,110 |
Bioreactor2 | SRR8313048 | Illumina | Paired | 44,258,996 |
Bioreactor3 | SRR8305972 | ONT | Single | 694,827 |
Protein reference databases.
A DIAMOND index was then generated using the following command (requiring 150 CPU minutes): diamond makedb –in nr.gz -d nr –taxonmap prot.accession2taxid.gz –taxonnodes nodes.dmp.
For each sequence in the “protein_sequences” table, we constructed a unique two-part accession string by concatenating its “gene_id” and “gtdb_id” values. For example, the protein sequence with gene identifier (ID) AE009439_1_1 and GTDB genome ID GB_GCA_000007185_1 was given the two-part accession AE009439_1_1__GB_GCA_000007185_1.
A DIAMOND index was then generated using the following command (requiring 33 CPU minutes): diamond makedb –in annotree.fasta.gz -d annotree.
MEGAN mapping databases.
We use the term “meganization” to refer to the process of analyzing the alignments of a set of sequences so as to perform taxonomic binning (for example, using the naive LCA algorithm for short reads or the interval-union LCA for long reads) and functional binning (usually using a best-hit approach). Meganization of DIAMOND alignments is performed either interactively using MEGAN or in a command-line fashion using the daa-meganizer program, which is bundled with MEGAN.
We used the above-described two-part accessions as the primary key for the mapping table. We determined the other entries of the mapping table as follows. The value for the GTDB and NCBI taxonomies were obtained from the “node_tax” tables of the two MySQL databases described above, using the gtdb_id part of the two-part accession.
The value for the KEGG classification was obtained from the “kegg_top_hits” tables of the two MySQL databases, using the gene_id part of the two-part accession. In the case that there is more than one possible KEGG assignment for a given protein, we randomly selected one. This was necessary because MEGAN allows at most one assignment per reference sequence. Both GTDB and KEGG IDs were additionally formatted to match the format required by MEGAN.
TABLE 5
Classification | NCBI-nr (no.) | AnnoTree (no.) | Ratio |
---|---|---|---|
NCBI taxonomy | 182,329,414 | 106,052,079 | 1.72 |
GTDB taxonomy | 126,956,422 | 106,052,079 | 1.2 |
EC | 4,501,593 | 2,962,187 | 1.51 |
eggNOG | 4,274,800 | 3,506,041 | 1.21 |
InterPro | 19,748,423 | 11,069,757 | 1.78 |
KEGG | 8,218,708b | 56,577,432 | 0.15 |
SEED | 31,117,272 | 16,183,436 | 1.92 |
a
For two different taxonomical classifications (NCBI and GTDB) and for five different functional classifications (EC, EGG, InterPro, KEGG, and SEED) supported by MEGAN, we report the number of prokaryotic accessions in NCBI-nr or AnnoTree that have a mapping to a class in the classification and the corresponding ratio.
For most functional classifications, the number of AnnoTree proteins with assignments is smaller than that for NCBI-nr proteins, which is because the AnnoTree assignments are copied from NCBI-nr assignments. In the case of KEGG, the assignments are obtained from the AnnoTree database, which appears to maintain a much richer mapping than previously provided by MEGAN.
Data set processing.
We ran DIAMOND (v2.0.4.142) on each short-read data set as diamond blastx -d 〈index〉 -q 〈input〉 -o 〈output〉 -f 100 -b24 -c1. Here, 〈index〉 is the index file, either annotree or nr, computed as described above. The input file, in (compressed) FastA or FastQ format, is specified by 〈input〉, and the output file is specified by 〈output〉. We used -f 100 to specify output in DAA format. The two remaining options, -b24 -c1, were used in an attempt to tune performance.
In addition, for purposes of comparison against the AnnoTree database, when running DIAMOND on NCBI-nr, we used the option -taxonlist 2,2157 to restrict alignment to bacteria (taxon ID 2) and archaea (taxon ID 2157).
When processing long reads, we also specified the -long-reads option.
The resulting DAA files were meganized using the daa-meganizer program MEGAN (version 6.21.5, ultimate edition, built 5 May 2021) as tools/daa-meganizer -i 〈input〉 -mdb 〈mapping〉. Here, the input file 〈input〉 is a DAA file produced by DIAMOND, and the mapping file 〈mapping〉 was either megan-map-Jul2020-2-ue.db or megan-mapping-annotree-June-2021.db, depending on whether the DIAMOND run was against the NCBI-nr or AnnTree protein database, respectively. When processing long reads, we also specified the -lg option.
Data comparison.
We used the MEGAN tool daa2info to extract the mapping of reads to taxonomic and functional classes obtained in both the NCBI-nr and AnnoTree runs of the DIAMOND+MEGAN pipeline.
The following command was used to extract the mapping of reads to classes for all classifications: tools/daa2info -i 〈input〉 -o 〈output〉 -l -m -r2c Taxonomy GTDB KEGG EC EGGNOG INTERPRO2GO SEED. Here, the input file 〈input〉 is a meganized DAA file and the output file 〈output〉 is a text file. The output was used to determine the assignment rates for different classifications.
Computational resources.
The DIAMOND+MEGAN pipeline was run on a Linux virtual machine (provided to us by de.NBI Cloud, highmem xlarge) with Ubuntu 18.04.4 LTS operating system, Intel Xeon Gold 6140 CPU, 2.30 GHz (processor model name), 28 sockets, 1 core per socket, 1 thread per core, 28 CPUs (on-line CPU list: 0 to 27), 504 GB of RAM, and 6 TB of hard disk space. Reported run times are “user time” as calculated using the Linux “time” command. Furthermore, for MEGAN, the RAM size was set to 500 GB. All other calculations were undertaken on a MacBookPro laptop with a 2.6-GHz 6-core (12 threads) Intel Core i7 processor and 16 GB 2400-MHz DDR4 RAM.
Statistical analysis.
Data availability.
ACKNOWLEDGMENTS
This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, and 031A538A). We also acknowledge support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC, and the German Research Foundation (DFG) through grant no. INST 37/935-1 FUGG. We acknowledge infrastructural support by the cluster of Excellence EXC2124 Controlling Microbes to Fight Infection (CMFI), project ID 390838134. C.B. was supported by the German Research Foundation (DFG) through grant no. HU 566/12-1. Further, we acknowledge support by the Open Access Publishing Fund of University of Tübingen.
D.H.H. and C.B. conceptualized the project. H.F. and C.B. performed the computations. A.G. and H.F. analyzed the results. A.G. and D.H.H. wrote the manuscript. All authors edited the manuscript.
Read more here: Source link