The genomic landscape of reference genomes of cultivated human gut bacteria

The expanded repertoire of isolate genomes in CGR2

In continuation of our previous work on cultivation and sequencing gut-resident microbes, a total of ~20,000 bacterial isolates were cultivated. 4066 of these isolates were selected for whole-genome sequencing generating 3324 high-quality genomes with more than 90% completeness and less than 10% contamination (Supplementary Fig. 1, and Supplementary Data 1). We subsequently clustered the 3324 genomes into 527 species-level clusters on the basis of 95% average nucleotide identity (ANI), of which 189 clusters (1804 genomes) were not included in the CGR. The clusters were distributed between 8 phyla, of which Bacillota represented more than half of the clusters (1805 genomes, 343 clusters). Notably, Synergistota, Thermodesulfobacteriota, and Verrucomicrobiota were newly included in CGR2 compared to CGR (Fig. 1a). Of the 527 species-level clusters, 179 were not classified at the species-level, and 21 lacked a genus-level match (Supplementary Data 2), indicating that these clusters harbor previously unidentified species. Whereas some important species were clearly underrepresented in CGR, CGR2 encompasses a large number of high-quality genomes of Bifidobacterium longum, Bifidobacterium pseudocatenulatum, Bifidobacterium adolescentis, Escherichia coli, and Enterococcus faecalis, which can be used for species pan-genome and diversity analyses (Supplementary Fig. 2).

Fig. 1: Taxonomic profile of CGR2.
figure 1

a Phylogenetic analysis of 3324 genomes. Color range indicates the 1804 newly sequenced genomes (blue) and the 1520 CGR genomes (pink). Singleton genomes are marked with red dots at the end of the clade. The first layer depicts the GTDB phylum annotation, the second layer describes the matching to the GTDB database at the species and genus level, and the circumferential bar plot (dark blue) illustrates the genome size. b Abundance and prevalence of 527 representative clusters in healthy cohorts of China, HMP, and the Netherlands. Gray box, Log10 (relative abundance); Dot, median of log10 (relative abundance); Bar, prevalence; Color, phylum. c Matching of CGR2 to the hGMB and UHGG genome collections. The Venn diagrams are colored according to the origin of the samples and the numbers are indicated. d The ratio of the genome length (median: 88.84%) and gene number (median: 89.33%) of the UHGG-Uncultured relative to CGR2 in the mapped genomes of each family. A dot represents a UHGG genome, and different colored dots indicate different family.

The distribution of 527 species-level clusters in 3 representative metagenomic cohorts of different origins, including China (a part of 4D-SZ)10, the Netherlands11, and HMP (Human Microbiome Project) is shown in Fig. 1b. The prevalence of Flavonifractor plautii, Bacteroides uniformis, and Bacteroides caccae exceeded 95% in three cohorts (Supplementary Data 3). Beta diversity showed that there were significant differences between the 527 clusters in the three cohorts (R2 = 0.2984, P < 0.001), especially between the cohorts from China and the Netherlands (Supplementary Fig. 3a). The correlations between the 527 clusters and the coordinates of microbial communities suggested that Prevotella sp. (Cluster 62), Bacillus luti, Paenibacillus sp. (Cluster 281), and Paenibacillus macerans had the most significant impact on the distribution of these clusters (P < 0.001, Supplementary Fig. 3a, and Supplementary Data 3). Species of Prevotella are important members of the human gut microbiota, and recent studies suggested a reclassification of Prevotella into seven genera12,13, which display different metabolic characteristics. In addition, compared with the cohort from the Netherlands, the medians and means of the 527 clusters in the cohort from China and the HMP cohort were more similar (Supplementary Fig. 3b, c). Examining the distribution of the 179 previously unidentified species in the different populations, we found that the average abundance of these previously unidentified species in the Chinese population was 0.08%, which was significantly higher than that in the other two cohorts (P < 0.0001, Supplementary Fig. 4a). However, the occurrence is much lower than that in the cohort from the Netherlands (P < 0.001, Supplementary Fig. 4b). 42 species were significantly enriched in the Chinese cohort (Supplementary Fig. 4d). Of note, the inclusion of previously unidentified species in CGR2 significantly improves metagenomic reads mapping rate in Chinese and non-Chinese populations, especially for the cohort from the Netherlands (P < 0.0001, Supplementary Fig. 4c, and Supplementary Data 3).

Mapping the CGR2 genomes against the 3312 genomes representing uncultured species and 438 genomes of cultured species from other sources, we found that 146 CGR2 genomes matched 126 UHGG genomes of uncultured species, and that 136 CGR2 genomes matched 48 genomes of cultured species from other sources in UHGG, illustrating how our collection increases the taxonomic diversity of cultivable microorganisms in the human gut (Fig. 1c). Comparing the matches from 146 CGR2 genomes and 126 UHGG-uncultured genomes, we found not surprisingly that the gene number and scaffold N50 of the genomes obtained by sequencing of cultivated isolates were significantly higher than those from MAGs (P < 0.0001, Supplementary Fig. 5). We compared genome length and gene differences of the genome sets to explore the assembly gaps between genomes based on isolates and MAGs. In general, less than 90% of the genome length and gene number of culture-based genomes were covered by the corresponding MAGs (Fig. 1d). Comparing genomes based on isolates with the corresponding MAGs, we identified 1543 unique genes present in the isolate-based genomes, but absent in the MAGs with Erm being the gene most frequently missing in the MAGs (Supplementary Fig. 6, and Supplementary Data 4). Only 286 CGR2 isolated genomes (91 clusters) mapped to 89 human gut culture-based genomes originating from non-Chinese samples in the UHGG, possibly reflecting the differences in the composition of the gut microbiota in individuals of different ethnicity and/or living in different geographical locations14. The Broad Institute-OpenBiome Microbiome Library (BIO-ML)15 is a human gut strain collection established from FMT donors. A comparison revealed that 82 of the BIO-ML species-level clusters were also included in CGR2, and only 22 BIO-ML species-level clusters were not included in CGR2, implying an 84.44% taxonomic novelty in CGR2 (Supplementary Fig. 7a). We also compared the genomes of CGR2 with hGMB16, a recent cultured genome collection based on Chinese samples, showing that 2306 of the CGR2 genomes covered 57.92% of the hGMB genomes, including 49 of 108 newly characterized and classified species. Overall, 144 clusters from CGR do not exist in any existing collection, and the newly sequenced genomes of CGR2 contributes with 45 unique clusters (Supplementary Fig. 7b). Further, we found that 89 of 179 previously unidentified species were not represented in UHGG, BIO-ML or hGMB, including one cluster being annotated only at the class-level (Supplementary Fig. 8a, and Supplementary Data 2). It is worth noting that these underrepresented previously unidentified species may be widely distributed in the cohorts from China and the Netherlands, and in the HMP cohort (Supplementary Fig. 8b–d). In addition, there are still 31 previously unidentified species in CGR2 that are only represented by MAGs, while we provide the cultured strains to facilitate subsequent taxonomic characterization (Supplementary Fig. 8a).

The distribution of carbohydrate-active enzymes (CAZymes)

To examine the function of the culture isolates of CGR2, we performed a comprehensive in-depth analysis of carbohydrate-active enzymes. Notably, isolates of the Bacteroidota phylum harbored the largest and most diverse CAZyme repertoires (Fig. 2a, and Supplementary Fig. 9a), consistent with previous studies17,18. 86.08% of the predicted CAZyme genes in CGR2 belong to the glycoside hydrolases family (GH) and the glycosyltransferases family (GT) (Supplementary Data 5). We next explored the potential for utilization of dietary fibers (including pectin, cellulose, and inulin) in the 3324 genomes. Bacteroidota contained more GH or polysaccharide lyase (PL) genes reported to potentially be involved in the degradation of dietary fiber (Supplementary Fig. 9b). Bacteria belonging to the Bacteroidota phylum are able to utilize a broad range of carbohydrate substrates19, and they may play a role in initiating the primary breakdown of dietary polysaccharides in the human gut.

Fig. 2: Distribution of CAZymes in the CGR2 genomes.
figure 2

a Differences in the numbers of genes encoding GH and PL, and the numbers of GH and PL families represented in these genomes. b Simplified illustration showing the degradation of dietary fibers and the synthesis of SCFAs by the dominant genera (see “Methods”). Ratio of the numbers of strains in top five genera with complete pathways and in genera of CGR2 is shown as pie plot charts (see Supplementary Data 6). Phylum annotation of bar chart as in (a). c Simplified schematics of pathways involved in HMO degradation and phylogenetic tree of 3324 genomes combined with the heatmap on the outermost layer indicating the presence or absence of HMO-degrading GH and CBM genes. The colored rings around the tree represent the taxonomic classifications using the same annotation as in panel (a). LNT, lacto-N-tetrose. LNB, lacto-N-biose. LNBP, lacto-N-biose phosphorylases. LNnT, lacto-N-neotetraose. Star marks bifidobacteria. d PCoA based on Bray–Curtis dissimilarity of the numbers of HMO-degrading CAZyme genes. Ellipses cover 95% of the genomes for each phylum. Phylum annotation as in (a).

We next collected all enzymes involved in decomposition pathways of pectin, cellulose and inulin, and synthetic pathways of the three short chain fatty acids (SCFAs), acetate, propionate, and butyrate, from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database to screen for potential glycan-degrading and SCFA-producing strains (Supplementary Fig. 10). The result showed that most strains in CGR2 had potentials for breaking down inulin (96%) and producing SCFAs, acetate (100%), propionate (97%), and butyrate (79%) (Supplementary Data 6). 193 Strains belonging to 42 genera were predicted to possess at least one complete enzymatic pathway for each of these metabolic pathways (Fig. 2b). Ten genera, Bacteroides, Bifidobacterium, Enterococcus, Phocaeicola, Blautia, Collinsella, Streptococcus, Enterocloster, Escherichia, and Mediterraneibacter contained the majority of strains with complete pathways for decomposing glycans or synthesizing SCFAs.

Human milk oligosaccharides (HMOs) play important roles in the early nutrition of beast-fed infants and serve as substrates supporting the growth of important members of the Bifidobacterium genus dominating the gut in early life20. Large type I and II HMOs, two main types of HMOs, can be broken down by GH95 AfcA, GH29 AfcB, and GH33 SiaBb releasing lacto-N-tetraose (LNT) and lacto-N-neotetraose (LNnT)21,22 (Fig. 2c). GH136 (lacto-N-biosidase) and GH112 (lacto-N-biose phosphorylase) are the core hydrolases in type I HMO degradation, and GH2/42 (β-galactosidase) and GH20 (β-N-acetylhexosaminidase) are the core hydrolases in type II HMO degradation21,22,23. The two carbohydrate-binding module families, CBM32 and CBM51, are possibly relevant to HMO degradation21. We next explored the distribution of these CAZyme families in our genome collection. In CGR2, GH2 is the most widely distributed GH, present in 81.6% of the genomes. Notably, 337 and 1546 genomes contained complete CAZyme families for degradation of type I and II HMOs, respectively (Supplementary Data 7). Members of the phylum Bacteroidota possess GH2, GH20, GH29, GH95, and CBM32, but lack GH112 and GH136, the key CAZyme families involved in type I HMO degradation (Fig. 2c). In addition to the presence and absence of CAZyme families involved in HMO degradation, unconstrained principal coordinate analysis (PCoA) was used to explore the similarity of the gene numbers of these CAZyme families. This analysis showed that members of Bacteroidota formed a distinct central cluster (Fig. 2d).

For bifidobacteria, GH2 and GH42 are the two most prevalent GH families, included in all genomes of this genus (Supplementary Fig. 11a). B. bifidum and B. longum contain type I and II HMO degrading CAZyme families. In addition, B. bifidum harbors more hydrolases and CBMs belonging to CAZyme families involved in HMO degradation, suggesting that this species is highly adapted for use of HMOs21. In addition, our analyses demonstrated that Roseburia, a butyrate producer possessing multiple HMO utilization CAZyme families, also has a potential for metabolizing HMOs consistent with a previous study23 (Supplementary Fig. 11b). These analyses support the notion that Bacteroidota together with bifidobacteria, belonging to the Actinomycetota, in CGR2, play a key role in promoting the release of HMO central moieties, consistent with previous reports21,24.

Identification of genes involved in the synthesis of secondary metabolites in the gut microbiome

We performed a comprehensive analysis of secondary metabolite biosynthetic gene clusters (SMBGs) using anti-SMASH (v4.2.0)25. Notably, 4132 gene clusters involved in the generation of secondary metabolites were identified in 2049 genomes (Supplementary Data 8 and “Methods”). Of these gene clusters, the most abundant were inferred to participate in the biosynthesis of sactipeptides (907), followed by non-ribosomal peptide synthetases (NRPSs, 804) and bacteriocin (740). A total of 24 different biosynthetic types were predicted to be present in the 8 phyla present in CGR2, with Bacillota harboring the highest abundance of SMBGs and a broad distribution of specialized metabolites (Fig. 3a).

Fig. 3: Abundant SMBGs in CGR2.
figure 3

a The distribution of SMGBs in different phyla. The size of the dot indicates the number of SMBGs. b Sequence similarity network of RiPPs. Nodes represent sequences of BGC domains and are colored by the phylum of the BGC-derived genome (black nodes: reference BGC). Edges drawn between the nodes correspond to pairwise distances. c The sequence evolution relationship of the RumA clan; different shapes indicate different modules.

4132 SMBGs clustered into 7 classes (978 gene cluster families, GCFs). Only 46 GCFs are included in the MIBiG database reference with known functions, indicating that most of the SMBGs we predicted lack experimental verification and might potentially represent novel functions (Supplementary Fig. 12). The largest class is ribosomally produced and post-translationally modified peptides (RiPPs), which include bacteriocins, lantipeptides, and sactipeptides. To better understand the diversity of SMBGs of relevance for RiPPs, sequence similarity networks were constructed for 2303 SMBGs (528 GCFs). The networks showed that most predicted RiPPs were from the Bacillota phylum, indicating that Bacillota could be a potentially abundant source of RiPPs (Fig. 3b).

Ruminococcin A (RumA), naturally produced by the strictly anaerobic bacterium Ruminococcus gnavus E1, has high activity against pathogenic Clostridium spp, and has been used for clinical treatment26. However, due to low production yields (<1 μg/L) and difficult cultivation of R. gnavus E1, high-quality production of RumA is challenging27. Our results showed that 7 SMBGs mined from Faecalimonas umbilicata, Mediterraneibacter faecis, Fusicatenibacter saccharivorans, Blautia sp., and Waltera intestinalis harbor genes related to the biosynthesis of RumA, suggesting that these bacteria may be used as a potential alternative source for the production of RumA (Fig. 3c). NRPSs constituted the second largest class, containing 816 SMBGs (198 GCFs), mainly from the Bacillota and Pseudomonadota phyla (Supplementary Fig. 13a). By analyzing 8 clans containing SMBGs with reliable experimental evidence, we discovered that one of the clans is related to dipeptide aldehydes, highly potent cell-permeable protease inhibitors, initially detected in Ruminococcus sp28, while the SMBGs of this clan were all derived from Blautia (Supplementary Fig. 13b). Similarly, for the other 5 classes of SMBGs, we also conducted a network similarity analysis (Supplementary Fig. 14). These results revealed an unexplored novelty and diversity of SMBGs from human gut microbes.

Prediction of prophages in the isolated genomes and phage-bacteria interactions in the gut microbiota

To construct detailed networks between phages and the host bacteria, we identified bacteriophages in the culture-derived genomes of CGR2 using VirSorter29. A total of 14,249 potential viral sequences were predicted from 3324 genomes, with 6274 being considered as “most confident” and “likely” phages and prophages30. After quality evaluation by CheckV, 22 phage genomes were assigned as complete, 150 as high-quality, and 2648 as medium-quality, representing the Viruses of the Cultivated Genome Reference (termed CGRv) (Supplementary Fig. 15a, and Supplementary Data 9). Comparing with the Gut Phage Database (GPD)31 and Metagenomic Gut Virus (MGV)32, 60.53% of the phages in CGRv were not reported previously (Supplementary Fig. 15b). In addition, we discovered three jumbo phages (>200 kb of length), a group of phages rarely described previously33.

A phylogenetic tree was constructed to explore the evolutionary characteristics of phages in the gut microbiota. Notably, phages from Actinomycetota clustered within a single clade (Fig. 4a). However, phages from Bacillota were widely distributed throughout the phylogenetic tree, reflecting a high level of variation of phages in this phylum. To investigate the phage diversity of the CGRv, we conducted a clustering and taxonomic approach using VConTACT v2.034 (Supplementary Data 10). In total, 2117 clustered phages were divided into 317 virus clusters (VCs), hosted by 315 bacterial species (59.8% in CGR2). For the taxonomy of phages in CGRv, only 269 phage genomes (12.7%) could be assigned to known families, including Siphoviridae (135), Myoviridae (108), and Podoviridae (26), whereas the vast majority of the phages were previously unidentified at the family level.

Fig. 4: Host range of phages and interactions between phages and bacteria.
figure 4

a Phylogenetic tree of 1919 phage sequences. The inner circle is colored according to the host phylum of the phages, and the outer circle is colored according to the phage family (227 phages were assigned). b Phylogenetic tree of CGR2 genomes showing phage host range. The height of the gray bars denotes the number of viruses the host harbors. Connections in four colors where each color represents one VC that can infect hosts in different phyla. c Phylogenetic analyses of PC_000663. Phages are named “VIR_host species_host number”. Leaf colors indicate to which phylum phages belong.

Next, we determined the host range of phages and phage-host network visualized using Cytoscape35 (Supplementary Fig. 16a). VC_399, a most infectious VC of CGRv, has the broadest range of bacterial hosts and can infect 31 bacterial species from Bacillota, Actinomycetota, and Synergistota (Supplementary Fig. 16b). In addition, Clostridium fessum and Phocaeicola vulgatus were the most targeted species that may be infected by 15 different VCs (Supplementary Fig. 16c). Strikingly, four VCs were contained in the genomes of bacteria spanning different phyla, including VC_399, VC_67, VC_195, and VC_220 (Fig. 4b). This may represent horizontal gene transfer events between host bacteria in different phyla. For the quantification of the host range of CGRv, we found that more than half of the VCs (167/317) may infect multiple species of bacteria (Supplementary Fig. 16d).

We conducted an identification of proteins encoded by CGRv. A total of 212,369 proteins were predicted and clustered as 25,345 non-redundant protein clusters (Supplementary Data 11). 169,859 proteins (79.98%) were uncharacterized, while 7394 proteins (3.48%) were annotated as related to phage structure. Particularly, PC_000663, a protein cluster involved in O-antigen conversion and bactoprenol glucosyltransferase, which are considered to be related to polymyxin resistance, was predicted to be encoded by 34 phages with 34 different bacterial hosts in the Bacillota and Pseudomonadota phyla (Fig. 4c). The proteomic tree of PC_000663 showed that Lactococcus garvieae TM115-50 and Lactococcus petaurid TM115-81 phylogenetically formed the closest clade. We found that the two species were infected by different VCs that can encode PC_000663, suggesting that the homology within PC_000663 may reflect horizontal gene transfer events.

Genome-wide analysis of Collinsella aerofaciens reveals high genomic and functional variations

Many bacteria exhibit wide variations between different niches. In this study, we discovered more than 10 clusters of Collinsella aerofaciens in CGR2, indicating a high level of genomic diversity of these bacteria. The genomic ANI clusters showed that 130 isolated genomes from CGR2 and 67 genomes retrieved from NCBI (including 10 UHGG isolated strains) were divided into 19 clades, included 8 singleton clades (Fig. 5a). Interestingly, the distribution of the 19 clades was inconsistent with their phylogenetic clades in the SNP phylogenetic tree. We then clustered these 197 genomes into 5 groups using CDS sequences highly consistent with the SNP phylogenetic tree, of which group 2 and group 4 were singletons (Fig. 5b), suggesting that the mutation in non-protein-coding intergenic regions is one cause of high genomic diversity. SNPs analysis showed that less than 10% SNPs were present in the intergenic regions, but more than twice the number of insertions or deletions (InDels) were detected in the intergenic regions compared to CDS regions (Fig. 5c, d, and Supplementary Fig. 17a, b), indicating that these species had rapid InDel structural variations in the intergenic regions. We checked the 10 CDSs with top mutation frequency and found that the variant type and frequency appeared to correlate with cluster and country (Supplementary Fig. 17c). Furthermore, the gene copy number of several CAZy enzymes differed significantly between different groups, with the gene numbers of GH1, GH4, and CE9 in the group 3 strains being higher than those of group 1 and group 5 strains (P < 0.01, Supplementary Fig. 18), which may underlie the differences in the capacity for utilization of cellulose36, cleavage of the glycosidic bond37, and biosynthesis of amino-sugar-nucleotides38. Taken together, we greatly increased the reported genomic diversity of C. aerofaciens providing important insights for uncovering the genomic and functional differences of different groups of C. aerofaciens.

Fig. 5: Genome-wide analysis of 197 Collinsella aerofaciens genomes.
figure 5

a Whole-genome phylogenetic tree constructed based on ANI. According to the distance of 0.05, the genome can be divided into 19 clades. b Consistency between the genomic SNP phylogenetic tree and the 5 groups of CDS sequences. Tip shapes and colors represent the genome source and ANI clade, respectively. Dashed lines indicate the singletons in group 2 and group 4. For each genome, the first layer represents the group, the second layer denotes the country where the strain was isolated. c Distribution of InDels and SNPs in CDS regions and intergenic regions for each genome. Colors represent the group. d The InDel and SNP variation statistics in CDS regions and intergenic regions. Colors represent the group. P values are from Wilcoxon rank-sum test (two-sided) (****P < 0.0001).

Read more here: Source link