In silico evaluation and selection of the best 16S rRNA gene primers for use in next-generation sequencing to detect oral bacteria and archaea | Microbiome

To the best of our knowledge, this is the first study that evaluates in silico the coverage of 16S rRNA gene primers for the detection of oral-bacterial and archaeal species. The primer sequences were obtained not only from sequencing-based studies of the microbiota inhabiting the human mouth, but also from an article containing primers used in ecosystems as dissimilar as the marine, geothermal, human gut, or cattle gut [16]. Thus, numerous primers from diverse ecosystems were analysed to find those which performed better in the oral cavity. Moreover, to perform the analysis, we improved an earlier database of 16S rRNA gene sequences of oral bacteria [33] and created another from scratch that contained sequences from archaeal species found in the oral cavity.

We identified a series of individual primers that performed well in the detection of oral bacteria and/or archaea and combined them to create primer pairs. These were defined as “bacteria-specific,” “archaea-specific,” or “bacterial and archaeal” based on the results of their levels of coverage set out in the two databases. We also produced a series of primer pairs that may be the most suitable combinations for use when sequencing the oral ecosystem. These were classified according to the domain targeted, their mean amplicon length category, and the 16S rRNA gene region amplified.

Comparative analysis of our coverage results of 16S rRNA gene primers with the literature

The investigation by Klindworth et al. [16] is perhaps the most comprehensive to date on the coverage and phylum spectrum of 16S rRNA primers. These authors assessed 175 primers and 512 primer pairs in silico against the Silva non-redundant reference database (version 108) [26], producing a selection of those that performed best for bacteria and archaea. Like us, this group organised the most suitable primer combinations for the different sequencing technologies into three categories according to their amplicon length (100–400, 400–1000, and >1000 bps). They then re-evaluated their analysis using the Global Ocean Sampling (GOS) dataset [56, 57], which is limited to the marine habitat, and they examined experimentally the primer pair that performed best [16].

We identified two investigations involving the oral ecosystem that used the Silva database [26]—versions 111 [19] or 132 [22]—to analyse the efficiency of 16S rRNA gene primers for detecting the archaea diversity in oral samples [22], or for reconstructing the microbiome of ancient dental calculus specimens [19]. Also, a third study evaluated the potential of seven primer pairs for detecting 219 species in a foregut dataset they created, which included oral, oesophageal, and gastric 16S rRNA gene sequences [23]. The pair with the best results for classifying the foregut genes was also analysed against the RDP database [27].

The numbers of 16S rRNA gene primer pairs evaluated in these three oral-related studies is substantially lower than in the present investigation: 12 individual primers combined to form 12 primer pairs [22]; 25 individual primers combined into 14 pairs [19]; and 14 individual primers grouped into 14 pairs [23]. In our study, we analysed 369 distinct individual primers and 4638 different primer-pair combinations. On the other hand, two investigations used the Silva database [19, 22, 26] which has broad phylogenetic diversity and contains information applicable to many environments but also includes 16S rRNA gene sequences that are misannotated taxonomically [33]. Specifically, comprehensive databases such as Silva [26], RDP [27], or Greengenes [58] have been estimated to have annotation error rates ranging from 17% to 10% [28], and their accuracy may also be reduced because they contain numerous sequences derived from some environments and only a few from others [29]. Furthermore, the evaluation of the primers’ coverage using an ecosystem-specific database, as in our study, would allow researchers to identify the species covered and not covered by a particular primer pair. In this sense, only Nossa et al. [23] evaluated the primer pairs against a self-created database containing sequences from their three niches of interest: oesophageal, oral, and gastric (together described as the foregut). Nevertheless, although this database contained 9484 sequences, only 2373 were oral and, overall, they represented just 219 bacterial species. These numbers are much lower than those in the bacteria database used in our study, which is based on eHOMD [34], and to which we added sequences from our self-created archaeal dataset (bacteria: 223,143 sequences, 769 species; archaea: 2842 sequences, 194 species).

In the present study, none of the individual primers yielded an SC= 100% when analysed against the oral bacteria or archaea databases. Due to this, it would not be possible to obtain a primer combination with such value as the coverage estimates of primer pairs are always lower than the values of the individual primers that form it. However, we do not know whether any of the primers included in this research could obtain such a value if mismatches were admitted.

Bacteria-specific primer pairs

Table 5 summarises our results and those in other publications, with the primer pairs ordered by the mean amplicon length and the domain targeted. Concerning the bacteria-specific candidates, our coverage estimates for KP_F047-KP_R021 (100–300 bps), KP_F049-KP_R033 (301–600 bps), KP_F056-KP_R074 (301–600 bps), KP_F033-KP_R060 (>600 bps), and KP_F047-KP_R053 (>600 bps) were similar to those of other studies, with differences no greater than 5.00% for both the bacteria and archaea domains [16]. It should be noted that the latter, classified here as having a mean amplicon length > 600 bps but put in the medium-length category by Klindworth [16], had a lower bacterial coverage when analysed against the GOS database [56, 57]. This was also the case for KP_F056-KP_R074 [16]. Moreover, the coverage values of the pairs KP_F077-KP_R071 (100–300 bps) and KP_F047-KP_R035 (301–600 bps) in other studies were similar to those in our research, but the archaeal coverage was notably higher—~31.00% [19] and ~83.00% [21, 25] more, respectively. KP_F047-KP_R035, which had an archaeal SC= 0.00% in our analysis, has been described elsewhere as having universal coverage for both archaea and bacteria [25]. We, therefore, believe that KP_F047-KP_R035 has value for detecting archaea in environmental [21, 25] or human gut [25] specimens, but not in samples from the oral cavity.

Table 5 Coverage findings described in the literature for the gene primer pairs analysed in the present study

The remaining candidates to be bacteria-specific primer pairs in all the amplicon length categories herein were better at detecting bacteria than in other studies, with differences of 43.00% to 5.00%. Only Klindworth’s [16] estimate for the bacterial coverage of KP_F033-KP_R050 was better than ours, with an approximate difference of ~9.00% between the studies (Table 5).

Archaea-specific primer pairs

Two of the primer pairs selected herein to detect oral-archaea species—KP_F018-KP_R002 (100–300 bps) and KP_F020-KP_R013 (301–600 bps)—have previously been described in other studies as the best options for targeting this domain [16]. Nonetheless, the archaeal SC values we obtained exceed Klindworth’s [16] by ~ 20.00% (Table 5). Klindworth’s group [16] also found that KP_F018-KP_R078 had the highest overall archaeal coverage for the amplicons with a long mean length. However, they did not recommend its use due to its low phylum spectrum. This combination produced bacterial SC values of 0.00% and an archaeal SC of 93.30% when analysed against our database (Table 5). Although this is a good result, we prefer OP_F114-KP_R013, which achieves better archaeal SC, or KP_F018-KP_R063, which produces both better archaeal SC and a greater mean amplicon length.

Bacterial and archaeal primer pairs

KP_F020-KP_R032 and KP_F020-KP_R035 (100-300 bps), with bacterial and archaeal coverage estimates >85.00% (mainly considering the Silva database [26]), have been proposed previously as suitable for the detection of both domains [16]. As seen in Table 5, the SC values obtained herein, ≥95.00%, are better than those in other studies [16, 22]. OP_F066-OP_F073 is also among the favoured primers in our study for the detection of bacteria and archaea when using short amplicon lengths (SC= 98.31% and 92.78%, respectively), achieving better coverage than in the research by Zhang et al. [17] (SC= 88.90% and 75.30%, respectively). Meanwhile, although other studies’ in silico analyses of OP_F014-OP_R014 have described it as a good primer pair for detecting the two domains [22, 25], it is not among our recommended primers, since others in the same length and gene-region categories achieved better archaeal coverage. KP_F059-KP_R078 has been proposed by Klindworth et al. [16] as suitable for use with both the bacteria and archaea domains when employing medium mean amplicon lengths (608 bps). However, its length was 622 bps in our study, meaning that it was included in the >600 bps group. In any case, although our coverage values were higher than those obtained previously (SC= 93.63% and 96.39% vs 74.10% and 72.30%, respectively) (Table 5), other primer pairs performed better in both categories as OP_F114-KP_R031 (301–600 bps; bacterial SC= 95.71% and archaeal SC= 98.45%) and OP_F066-OP_R121 (>600 bps; bacterial SC= 94.54% and archaeal SC= 96.91%).

Non-covered species by the 16S rRNA gene primer pairs

The in silico analysis has enabled us to verify that, among the pairs achieving better coverage, the species not covered by the primers targeting a particular region tend to be covered by others relating to a different zone. In this sense, most of the species that were not covered by the bacteria-specific primer pairs from regions 3-4 (100–300 bps), 3-5 (301–600 bps), or 3-7 (>600 bps) were by those from 5-7 and 6-7 (100–300 bps), 4-7 and 7-9 (301–600 bps), or 4-9 (>600 bps), and vice versa. This was also seen in the archaea-specific primers, where taxa not detected by the pairs from regions 3 (100–300 bps), 3-5 (301–600 bps), or 3-6 (>600 bps) were covered by those from 5-6 (100–300 bps), 3-6 and 5-9 (301–600 bps), or 3-9 and 5-9 (>600 bps), and vice versa. Lastly, the pairs for the two domains combined also demonstrated that species not covered by primers from zones 4-5 (100–300 bps) or 3-5 (301–600 bps) were by those from 5-6 (100–300 bps) or 4-6 (301–600 bps), and vice versa. In the combinations with mean amplicon lengths >600 bps, half the taxa that were not covered using primers for amplifying region 3-9 were detected when targeting 5-9. However, in this case, the opposite was not true.

There were exceptions to this general rule, which demonstrated that even for two primer pairs targeting the same gene region, one would be able to cover most of the species that were not detected by the other. As an example, the bacteria-specific pair KP_F048-OP_R043 detected 18 of the 33 species not covered by 18 different primer pairs formed by combining KP_F044, 046, 047, OP_F048, 096, or 108 and KP_R071, OP_R040, or 146 (gene region 3-4; 100-300 bps); and, also, OP_F101-OP_R030 covered 33 of the 62 species not detected by KP_F061-KP_R074 (6-7; 100-300 bps). Furthermore, 54 of the 75 bacteria not covered by KP_F048-OP_R050 were detected using KP_F048-OP_R073 and OP_F050-OP_R073 (3-6; 301-600 bps); and 22 of the 28 species not detected by KP_F051-KP_R053 and OP_F021-KP_R053 were covered by all the other primers targeting region 4-7 (301-600 bps). Also, in the long primer pairs, KP_F048-OP_R030 (3–7) covered 20 of the 37 non-detected taxa by the combinations of KP_F044, 046, 047, OP_F048, 096, or 108 with KP_R074. Meanwhile, the archaea-specific primer KP_F016-KP_R002 covered almost all (6/8) of the taxa not detected using KP_F018-KP_R002, -KP_R003, and -OP_R102 (3, 100-300 bps). In addition, KP_F016-KP_R032 was the only primer from region 3-5 (301-600 bps) able to identify six archaea: Candidatus Korarchaeum cryptofilum, Ferroplasma acidarmanus, Fervidicoccus fontis, Metallosphaera cuprina, Methanocorpusculum labreanum, and Thermophilus pendens; that were not detected by the rest of the primers from the same region. Nevertheless, these exceptions were not observed in the bacterial and archaeal primer pairs.

It is clear that most of the taxa not detected by these well-performing primer pairs must have been identified as present in the oral cavity at some point, or they would not have been included in the databases used for our in silico analysis. However, some of them are microbes associated with prevalent oral pathologies, such as periodontal disease or dental caries. We distinguished four recognised Campylobacter species among the bacteria not detected by some of the bacteria-specific primer pairs: concisus, gracilis, rectus, and showae. The first of these, as part of the Socransky green complex, has traditionally been associated with periodontal health; the remaining three are components of the orange complex, which is related to periodontitis [59]. A further three bacteria commonly found in the healthy periodontium, Leptotrichia bucalis [60], Leptotrichia hofstadii [60], and Rothia dentocariosa [61, 62], were also missed by some of the bacteria-specific and/or bacterial and archaeal primer pairs. Conversely, a few failed to cover bacterial taxa isolated from periodontally-diseased sites (in teeth or implants) or those regarded as novel periodontal pathogens, e.g., Actinomyces dentalis [63], Actinomyces israelii [63], Desulfomicrobium orale [64], Mogibacterium timidum [65], Solobacterium moorei [66, 67], Treponema lecithinolyticum [61, 65, 67], and Treponema maltophilum [68]. A further Actinomyces species, previously classified as naeslundii WVA 963 and now known as johnsonii [69], which has been encountered in both healthy and periodontitis sites [63], was not detected by some of the pairs that produced better coverage estimates. Moreover, different taxa from the phyla Saccharibacteria (TM7), which growing evidence links to periodontal disease [70], were also not covered. Meanwhile, the caries-associated bacterial species that were not detected by some of the primer pairs included Bifidobacterium dentium [71, 72], Lactobacillus reuteri [73], Leptotrichia buccalis [74], Parascardovia denticolens [73, 75], R. dentocariosa [76], and Scardovia wiggsiae [77].

The undetected archaeal species by some of the archaea-specific and/or bacterial and archaeal primer pairs included Methanobrevibacter gottschalkii, Methanopyrus kandleri, Nitrosoarchaeum limnia, and Nitrososphaera evergladensis; these species have been found, in order, in inflamed pulp tissue [78], periodontitis samples [79], endodontic infections [80], and ancient dental calculus [81, 82]. The rest of the non-detected archaea were extracted from the same publication [83] and, as far as we know, not reported by other authors so their role in the oral cavity has yet to be investigated.

Consequently, it would be preferable to choose a primer pair based on the health or disease condition being investigated. If it is known which oral species are not covered by each primer pair in the oral-specific database as we demonstrated for the first time in this study, and which taxa are most commonly associated with the target oral condition, it is possible to select the most optimal primer pair to use in the sequencing-based studies of the oral microbiota.

Primer pairs frequently used in the oral microbiome literature

Finally, our review of the literature found that 206 distinct primer pairs have been utilised to study the oral microbiota via massive sequencing techniques. The combinations employed most commonly were KP_F078-OP_R010 and KP_F047-KP_R035, which were repeated 33 and 21 times, respectively. These were followed by KP_F014-KP_R011, KP_F034-KP_R065, KP_F031-KP_R021, and OP_F009-OP_R029, which appeared in eight, eight, seven, and seven articles, respectively. Four, three, four, 10, and 21 distinct pairs were repeated six, five, four, three, and two times in the sequencing-based studies of the oral microbiome. Lastly, 158 were found only once.

Only 67 of these 206 pairs were evaluated in the present study. This means that at least one of the individual primers from the remaining 139 combinations had a bacterial and archaeal SC < 75.00%. The widely employed primer KP_F078-OP_R010, which targets region 4 and is typically found as 515F-806R, was developed by Caporaso et al. [84] for use in the Illumina sequencing platform. The in silico analysis herein revealed bacterial and archaeal SC estimates of 95.32% and 62.89%, respectively (mean amplicon length= 292 bps), but failed to detect M. kandleri, N. limnia, and N. evergladensis, among other archaeal species. Numerous primer combinations in the same length category (100–300 bps) and targeting the same gene region (4-5) provided better SC for both domains, e.g., KP_F020 and KP_R031, KP_R032, or OP_R070 (bacterial SC range= 95.97–95.58%; archaeal SC range= 99.48–98.97%; mean amplicon length range= 287–284 bps). If only bacteria are to be detected, the primer pair from the same region, OP_F098-OP_R119, is preferrable; although it had a slightly lower bacterial SC (94.54%), its archaeal SC was 0.00%, meaning that no 16S rRNA gene sequence from an oral archaeon would limit the sequencing depth.

KP_F047-KP_R035, directed to amplify region 3-4, has been referred to as 341F-785R, 341F-805R, or 341F-806R and is the pair proposed in the Illumina protocol for the preparation of the sequencing library [85]. In the in silico analysis, it achieved species coverages of 94.15% and 0.00% in the oral-bacteria and oral-archaea databases, respectively, (mean amplicon length= 460 bps). Surprisingly, this pair did not cover the previously mentioned bacterial species A. dentalis, A. israelii, A. johnsonii, D. orale, L. reuteri, M. timidum, T. lecithinolyticum, and T. maltophilum. Although it has been used extensively in oral microbiome studies, in our investigation other primers in the same length category (301–600 bps) and the same gene region (3–5) had better bacterial coverage values and a similar mean amplicon length as KP_F048-KP_R031 (SC= 97.53%). This latter pairing, as well as all those in the same length category and targeting the same region, also failed to detect A. dentalis, A. israelii, A. johnsonii and D. orale. However, unlike KP_F047-KP_R035, L. reuteri, M. timidum, T. lecithinolyticum, and T. maltophilum were covered.

Another widely used primer is 785F-1175R, which has been employed to amplify gene region 5-7. The in silico evaluation of the pair named herein as OP_F009-OP_R029 yielded bacterial SC values of 88.30% and an archaeal SC of 0.00% (mean amplicon length= 410 bps). This, along with all the other bacteria-specific combinations within the same gene region (5–8) and amplicon length category (301–600 bps), was not among the best primers in our investigation (bacterial SC range= 89.60–79.97%; archaeal SC= 0.00%; mean amplicon length range= 411–408 bps). In fact, OP_F009-OP_R029 not only failed to detect A. dentalis, A. israelii, A. johnsonii, C. concisus, C. gracilis, C. rectus, and C. showae, but also the microbes that are widely known to be associated with periodontitis, Porphyromonas endodontalis [86,87,88], Porphyromonas gingivalis [59, 86, 88, 89], and Tannerella forsythia [59, 88, 89]. Consequently, it is preferable to amplify region 4-7 using the pairs KP_F051-OP_R030 or OP_F021-OP_R030, which had better bacterial SC (98.83% and 98.70%, respectively) and mean amplicon lengths (566 bps), and also detected the bacterial species referred to above.

Factors to consider when selecting a 16S rRNA gene primer pair

Although we defined which primer pairs had the highest coverage results for detecting oral bacteria and archaea, this does not necessarily mean that they would be always the best option for any sequencing-based research on the oral microbiome. Other factors such as the amplicon length or gene region targeted should also be taken into account when selecting the optimum primer pair as we did for constructing Tables 2, 3, and 4. Although PCR efficiency decreases when the amplicon length increases [90], in general terms, the longer the fragment sequenced, the lower the taxonomic level that can be achieved [17]. Indeed, sequencing full-lengths, as is possible with PacBio, is regarded as the solution to the limitations of taxonomic classification [17]. Nevertheless, Soergel et al. [29] evaluated primer pairs in common use and found that longer gene amplicons did not necessarily confer better classifications, with the target region (depending on the sample’s origin) impacting taxonomic assignment the most. Similarly, other authors have recently found that the different 16S rRNA gene regions contain varying amounts of information, which significantly affects the composition of the bacterial community [17]. Consequently, we agree that the choice of the target region is also an important factor [17, 29].

In this sense, our study provides the scientific community with information on these three aspects to consider in the selection of a primer pair for a total of 4638 primer pairs, adding the description of the covered and non-covered taxa for each primer pair. Our goal is to enable researchers to select the best primer pair to meet their research expectations.

In view of our results, the bacteria-specific primer pairs showed very similar average coverage values in the three mean amplicon length categories (100-300 bps: 94.19%; 301-600 bps: 94.71%; >600 bps: 94.87%). However, the archaea-specific primer combinations with short mean amplicon lengths had a slightly higher overall coverage than those from the other two categories (100-300: 95.54% vs. 301-600: 93.30% and >600: 93.81%); and the bacterial and archaeal primer pairs with short and long mean amplicon lengths performed better than those from the medium category (100-300: 97.08% and >600: 96.91% vs. 301-600: 94.83%). Given the above, as the differences in overall coverage between the three amplicon length categories were not too large, the choice of amplicon length, and consequently the sequencing platform, is left to the discretion of the researchers. On the other hand, the gene regions showing the greatest coverage values in the three mean amplicon length categories (in order 100-300, 301-600, and >600) were as follows: 3-4, 4-7, and 3-7 (bacteria-specific); 3 or 5-6, 3-5 or 3-6, and 3-6 (archaea-specific); and 4-5, 3-5, and 5-9 (bacterial and archaeal). Thus, for the different lengths of the bacteria-specific primers and the bacterial and archaeal primers, there was a specific region associated with the highest coverage, which was not observed in the archaea-specific primers where greater variability was detected. Moreover, except for the archaeal-specific and the bacterial and archaeal primer pairs in the medium mean amplicon length category, in which region 3-5 showed the highest coverage estimates; there was no consensus between the three types of primer pairs on the most informative region for a given category.

Limitations of the present study

The main limitation of the present study arises from the lack of information on the first and last positions in the sequence annotations stored by the NCBI [37], which suggests that primers targeting these gene regions may have lower VC values. We, therefore, calculated the SC estimates, since a particular species would be regarded as covered by a particular primer if at least one of its variants is amplified. In addition, we were unable to identify the complete genome of some archaeal species in our database (C. K. cryptofilum, M. gottschalkii strain HO, Methanobrevibacter oralis, Methanobrevibacter thaueri strain CW, and N. limnia). Given that the gene sequences from these taxa were not obtained in the same way as for the other species, we cannot be sure that there are no sequence variants in addition to those found in our investigation. It should, however, be noted that the oral archaea database developed by our group is the first proposal and may, therefore, be subject to change. Moreover, additional scientific evidence on the archaea species associated with the oral cavity and its diseases is required to increase the amount of information contained in the 16S rRNA sequence databases.

In consequence, the results of our in silico analysis have potential use in studies of the oral microbiome and need to be confirmed in other experimental studies using omics techniques.

Read more here: Source link