Genome-wide estimation of recombination, mutation and positive selection enlightens diversification drivers of Mycobacterium bovis

Global phylogenetic analysis

A Maximum Likelihood (ML) phylogenetic tree based on the 69 M. bovis isolates and reference genome was obtained (Fig. 2A). This strategy allows the generation of a more robust tree, when comparing with single gene based trees or multi-locus based trees, that do not capture the variability across the entire genome and consequently present low inter-specific discriminatory power68,69. The resulting topology of the ML tree generally agrees with clonal complex classification, with genomes of Eu2 clustering in one tree branch and genomes of Af1 also clustering together (Fig. 2A). Results are also in agreement with the known M. bovis evolutionary relationships that present a large division between Eu1 members and a group composed by all the other clonal complexes and genomes without assigned clonal complex30. Small inconsistencies between clonal complex and the relationships observed at the phylogenetic tree can be explained by the fact that clonal complexes are described based on specific genomic regions, while the phylogenetic tree is based on core genome multi-alignment representing the whole genomes.

Figure 2
figure2

Maximum likelihood phylogenetic tree (GTR) built based on core-genome alignment of M. bovis genomes before (A) and after (B) the removal of recombination sites. Branch colors represent M. bovis clonal complexes: purple for European 1, red for European 2, blue for European 3, orange for African 1 and green for African 2. The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.

Evidences of recombination in Mycobacterium bovis

Mycobacterium tuberculosis complex is described to have clonally evolved, and most evidences accumulated over the years support the idea that ongoing HGT and recombination events do not occur at detectable levels in the MTBC15,17,18.

Previous works have suggested that there might be limited recombination among MTBC strains20,21, while others were not successful to identify measurable recombination events70,71. To revisit this issue with focus on M. bovis, and unlike previous works that only accounted for M. tuberculosis70,71; or that accounted MTBC as a whole, with few M. bovis representatives20; or that only considered a restrict M. bovis dataset21, in this work a total of 70 strains, with representatives from all clonal complexes, was used to screen for recombination. The dataset was scaled in four cumulative levels: (1) Eu2 members, (2) all European clonal complexes members (i.e. European), (3) both European and African clonal complexes (Eu + Af) and (4) the entire dataset (encompassing the genomes that are not included in any of the clonal complexes already described).

To investigate this postulate further, a split decomposition network was performed to assess for the absence of recombination events between genomes, since this method enables the visualization of ancestral relationships between individuals and displays conflicting phylogenetic signals. The presence of cycles in the network (i.e. regions that do not converge into a single tree), was confirmed in all four datasets under analysis, however none was supported statistically by the Phi test (Eu2, p = 0.0956; European, p = 0.1637; Eu + Af p = 0.2774; entire dataset p = 0.2451), providing poor evidence for the presence of recombination events (Fig. 3A-D).

Figure 3
figure3

Visualization of conflicting phylogenetic signals at unrooted phylogenetic trees by the split decomposition method in European 2 genomes (n = 37) (A), in European genomes (n = 44) (B), in a combination of European and African genomes (n = 51) (C) and in the entire dataset (n = 70) (D).

Following this analysis, and considering the observation of cycles in all networks, the reconstruction algorithm implemented in Gubbins pipeline was applied in order to reconstruct the clonal genealogy and to perform a complementary estimation of the impact of recombination in M. bovis genomes. A cumulative number of recombination events was inferred with the majority occurring in terminal branches (i.e. occurring in a single genome) (Table 2). The metrics showed consistency across the datasets and revealed that recombination events occurred two hundred to three hundred times less frequently than mutations, once the rho/theta parameter that represents the relative rates of recombination and point mutation on a branch presented an average value between 0.0037 and 0.0056 (Table 3). Recently, a published work with 38 M. bovis strains evidenced a higher rho/theta value (rho/theta = 0.1) than the one obtained for this dataset21, however the work by Patané and co-workers used reference-based assemblies to infer recombination parameters, a procedure detail that was already associated with enrichment of putative recombination events at terminal branches due to the assembly procedure70.

Table 2 Number of recombination events inferred by the Gubbins pipeline and RDP4.
Table 3 Recombination metrics obtained through the Gubbins pipeline analysis.

Following, the r/m parameter, which represents the ratio of diversity introduced by recombination and mutation, revealed an average value between 0.025 and 0.037, pointing that recombination has a lower overall effect in M. bovis genetic diversity when comparing with mutation (Table 3). To make a broad comparison, the r/m parameter was estimated using a similar methodology for an MTBC dataset composed by 23 genomes, revealing a mean value of 0.48620, while for the 38 M. bovis dataset of Patané and co-workers21 it evidenced a mean value of 0.98. In the first study there were only two M. bovis (M. bovis BCG and reference strain) within the 23 genomes included in the work, so the obtained value might be biased by the overrepresentation of M. tuberculosis genomes. In the second report, the M. bovis population under analysis was mainly recovered from American countries and livestock hosts. In contrast, in our dataset, a higher number of geographic locations and host species is represented, and genomes grouped into different clonal complexes with distinct population genetic signatures were also used, enabling a deeper and wider population knowledge. The differential r/m average values obtained with our dataset are consistent with the notion that the extent of recombination vary widely among lineages assigned to the same taxonomical species, so these results suggest that M. bovis clonal complexes might exhibit a differential impact of recombination, as also suggested by Didelot & Maiden72. Nevertheless, enlarging significantly this dataset with the inclusion of a higher number of M. bovis genomes would allow further clarification of this point. Both r/m and rho/theta parameters present variability among the tree branches, a result that is in agreement with reports concerning other bacterial species72,73.

Finally, to confirm the recombination events identified by Gubbins pipeline, the different core multi-alignments were also independently tested in RDP4 software with six different algorithms. Globally, less than half of the events identified by Gubbins were confirmed by RDP4 (Tables 4, 5). Considering the entire dataset, three recombination events were confirmed, two involving internal nodes and another one involving a single genome in a terminal branch and for which a clonal complex could not be assigned (Tables 4, 5). The identification of events in terminal branches might be a sign that recombination is still ongoing in contemporary M. bovis strains or the result of misalignment70. In this putative recombination region, circa 20% of positions have an undefined nucleotide (N), which can therefore influence the recombination signal (Supplementary Fig. 2). Moreover, this region affects the rrs gene, encoding the 16S ribosomal RNA that is expected to be highly conserved, so this putative recombination signal could be the result of a sequencing error or wrong alignment. Whole genome alignment between Mb0003 and M. bovis AF2122/97 was thus then performed and the presence of undefined nucleotides and of SNPs was confirmed, so the likely issues related to wrong alignment did not arrive as a consequence of the bioinformatics procedure implemented in this work.

Table 4 Detailed information concerning the recombination events identified by Gubbins and RDP4 in the entire dataset.
Table 5 Statistical values associated with different algorithms implemented in RDP4 for the confirmed recombination events.

No gaps or undefined nucleotides were identified in the recombination regions of internal nodes (Figs. 4, 5). With respect to these events, one encompasses exclusively Eu2 genomes, affecting the pks12 gene that encodes a probable polyketide synthase; while the other one is registered across Eu1 genomes and affects narX gene encoding a probable nitrate reductase (Table 4). Overall, the recombination analysis suggested the presence of a limited number of recombination segments with statistical support, and the inferred metrics indicate a lower effect of recombination on M. bovis genealogy. The recombination signal was expected to be low, however it is important to distinguish true evolutionary signals from background noise, which is a challenging task. In order to decrease the noise signal proposed to be introduced by reference-based assemblies and misalignment issues70,71, with the exception of complete genomes, all the remaining ones were de novo assembled and the quality of assemblies was checked and secured via QUAST pipeline analysis (Supplementary Table 1). Moreover, a series of complementary analyses was performed to provide robustness and accurateness to the overall investigation. Thus, the quality of sequencing of narX and pks12 genes was evaluated by read mapping against M. bovis AF2122/97. The SNP positions suggested in the recombination region were confirmed by applying the criteria referred in the methods section (at least 20 reads and 0.9 frequency of alteration). The polymorphisms at narX gene were fully confirmed in two genomes (Mb1792361 and Mb7240415; 2.3%), as well as in the case of pks12 gene for genomes Mb0891, Mb1711, Mb1789, Mb1870, Mb1758, Mb2043, Mb1960. However, for genome Mb2043, six out of eight positions did not meet the read depth criteria because the SNPs were supported by a maximum of 17 reads that was below the established cut-off of 20. Recombination at this genome spot could thus be confirmed for six genomes (8.6%) (Figs. 4, 5).

Figure 4
figure4

Detailed visualization of alignment in the recombination region of M. bovis dataset affecting the narX gene encoding a probable nitrate reductase. No gaps or undefined nucleotides were identified in the recombination region of internal nodes. This specific event was registered across Eu1 genomes. The quality of sequencing of narX gene was evaluated by read mapping against M. bovis AF2122/97. The SNP positions suggested in the recombination region were confirmed by applying the criteria referred to in the methods section (at least 20 reads and 0.9 frequency of alteration). The polymorphisms at narX gene were fully confirmed in genomes Mb1792361 and Mb7240415 (2.3%).

Figure 5
figure5

Detailed visualization of alignment in the recombination region of M. bovis dataset affecting the pks12 gene. No gaps or undefined nucleotides were identified in the recombination region of internal nodes. With respect to this event affecting the pks12 gene that encodes a probable polyketide synthase, it encompasses exclusively Eu2 genomes. The quality of sequencing of pks12 was evaluated by read mapping against M. bovis AF2122/97. The SNP positions suggested in the recombination region were confirmed by applying the criteria referred to in the methods section (at least 20 reads and 0.9 frequency of alteration). The polymorphisms were fully confirmed for genomes Mb0891, Mb1711, Mb1789, Mb1870, Mb1758, Mb2043, Mb1960.

PE and PPE genes have repetitive regions prone to misreading by Illumina sequencing and mis-mapping and so are commonly removed from the bioinformatics workflow of Mycobacterium tuberculosis members only when a strategy of map to sequence is used. The inference of recombination events applied in this work was based on de novo assemblies for which PE/PPE were not filtered out. We believe that the strategy applied, with the implementation of three different, complementary approaches and algorithms by SplitsTree, Gubbins pipeline and RDP4 software, is robust to deal and filter recombination regions arising from false signals. Nevertheless, to exclude the interference of PE/PPE genes on the identification of SNP clusters by Gubbins and RDP4 software, and consequently on the identification of the recombination regions proposed to affect narX and pks12 genes, the neighbourhood of these genes was inspected (Supplementary Fig. 35). In M. bovis AF2122/97, the narX gene is delimited by narK2 and Mb1764c, while pks12 is surrounded by Mb2075c e Mb2073c (Supplementary Fig. 35). Synteny maps with MAUVE using complete genomes yielded plots providing information about gene order conservation and rearrangements, showing four colinear blocks, without signs of genome translocations or inversions. Furthermore, a complementary analysis with aminoacidic sequences evidenced synteny in all complete genomes and no PE/PPE were identified in the neighbourhood regions of narX or pks12. For narX, one genome (Mb0030) had a lower synteny score, since narX gene is identified in two segments (segment 1891 and 1890). For pks12, Mb0030 and Mb003 present lower synteny scores due to a similar situation, whereas pks12 is identified in two and three segments, respectively, representing different domains of the protein (Supplementary Fig. 35). Considering this information and that both Gubbins and RDP4 software perform an analysis inspecting the core multi-alignment in windows with a maximum of 500 bp, we confirmed that the PE/PPE genes did not interfere with the recombination signals affecting narX and pks12.

Although the recombination signals detected in this dataset may be considered residual, recombination in M. bovis cannot indeed be excluded and should thus continue to be the subject of further analyses for which sequencing of whole genomes from different epidemiological scenarios is crucial.

Comparing the obtained ML phylogenetic trees before and after the recombination correction (Fig. 2A,B) did not lead to significant changes in the inferred phylogenetic relationships, with M. bovis strains being gathered within the same groups.

An evolutionary scenario for M. bovis from a multi-host TB system in Portugal

A SNP alignment containing 1816 polymorphic positions was obtained after mapping reads of 42 newly sequenced M. bovis against the reference genome of M. bovis AF2122/97. The majority of SNPs (87.1%) was located in coding regions and the affected genes were characterized according to functional categories displayed in Bovilist (Fig. 6A,B). After accounting for the total number of genes per functional category, the genes encompassed in “Lipid metabolism” category presented the higher number of SNPs, followed by “Cell wall and cell process” and “Intermediary metabolism and respiration”, revealing their underlying importance in M. bovis evolution.

Figure 6
figure6

Stratified analysis for the M. bovis dataset from Portugal (n = 42). Total number of SNPs and affected genes registered per functional category (A). Total number of synonymous and non-synonymous alterations registered by functional category (B).

Globally, the average dN/dS ratio is superior to 1.5, which suggests a global evolutionary pressure to escape from the ancestral state and representing positive (diversifying or directional) and/or relaxed purifying selection scenarios. In the categories “Virulence, detoxification, adaptation”, “Insertion seqs and phages” and “Regulatory proteins”, over two-thirds of SNPs were non-synonymous (Fig. 6B).

In all categories, there were genes with more than one SNP, leading to an average rate of mutation (i.e. the mean value of SNPs per gene) greater than one (Fig. 6A). The higher mutation values were harboured by pks12 (Mb2074c) with 15 SNPs and fas (Mb2553c) with 8 SNPs. Both genes are involved in fatty acid metabolism. The pks genes encode polyketide synthases (PKS) which are multifunctional enzymes involved in the biosynthesis of mycobacterial cell wall lipids74,75. This gene encodes a multifunctional polypeptide that is involved in the synthesis of mycoketides74,76. The fas gene is involved in the synthesis of mycolic acids. Both genes play an import role in the biosynthesis of the cell wall that is at the interface with the host.

SNP-detailed analysis of HGT and 3R genes

To further study the evolutionary processes within M. bovis, two specific groups of genes were analysed. Previous published works using sequence composition and phylogenetic methods identified genes that were acquired through HGT by the MTBC ancestor before diversification37,38. Those genes are listed in Supplementary Table 2. The SNP distribution was analysed in a total of 77 genes presumably involved in HGT, and 26 polymorphic sites were identified, leading, in the majority of cases (78%), to a non-synonymous (NS) change (Supplementary Table 2). Previous work conducted with MTBC genomes evidenced that putative HGT regions present a higher ratio of NS SNPs when comparing with the rest of the genome20. If one considers that these recombination tracts were acquired by the MTBC ancestor and, thus, they over-represent ancient polymorphisms, then it would be expected a higher fraction of synonymous alterations, since NS substitutions are expected to be eliminated by negative selection, as the changes in amino acid might modify protein function. So, our results suggest that functional consequences may arise from substitutions in HGT-like genes, which remits to their importance on valuable adaptive genetic diversity.

In parallel with this analysis, the genes encoding 3R (DNA repair, replication and recombination) system components were thoroughly examined, following the previous published list by dos Vultos and collaborators (2008)39. The exchanges of identical DNA fragments cannot be directly observed, although it might be a frequent process when involving closely related bacteria, such as in the case of this dataset; plus, this process might be crucial as a DNA repair method72 and thus play a role in homologous recombination. A total of 26 polymorphic positions distributed by 54 genes were identified (Supplementary Table 3). In this group of genes, NS changes account for about 65% of the consequences, which is in agreement with a previous report for M. tuberculosis strains39.

Gene and nucleotide diversity (π) were evaluated for the genes presenting polymorphisms. Gene diversity is a measure of the uniqueness of a particular gene sequence in a population. Average values of 0.256 and 0.226 were obtained for HGT and 3R group genes, respectively. When the value of gene diversity index is zero, all the sequences under analysis are equal. Therefore, the values obtained in this work reveal that there is limited genetic diversity within the selected panel of genes. The nucleotide diversity (π) compares the similarity per site between two nucleotide sequences. When π is superior to 0.003 it can be considered that the group of sequences under analysis is highly diverse. In our analysis, both gene groups reveal an average value inferior to 0.003, with HGT registering 0.00034 and the 3R circa. 0.00021. No gene had a π value higher than 0.003, thus also confirming limited nucleotide diversity within the selected gene panels.

The Tajima’s D test of neutrality was also evaluated, and in both groups there were genes with positive and negative values, evidenced by an average value inferior to zero. The selection against deleterious mutations, past selective sweeps and population expansion after a recent bottleneck are pointed as possible causes to decrease the result from Tajima’s D test.

Balance of forces in M. bovis evolution

Natural selection is a mechanism of evolution and has been associated with MTBC evolution9. Selective sweeps (i.e. positive selection that leads to the fixation of a new beneficial mutation) and background selection (i.e. selection against a deleterious mutation that leads to the elimination of any mutation linked to the target of selection) are both linked to the action of natural selection.

In this work, several evidences support the importance of natural selection: (1) SNP distribution is not random, with genes included in the “lipid metabolism”, “cell wall and cell processes” and “intermediary metabolism and respiration” categories presenting a higher SNP rate; (2) regions proposed to be transferred from MTBC ancestor also accumulate an excess of SNPs; and (3) the HGT and 3R groups evidenced a global average value inferior to zero in the neutrality tests, indicating a past selective sweep or expansion after bottleneck. Furthermore, the high proportion of low-frequency genetic variants, particularly singletons, is one of the features associated with MTBC population genetics, and proposed to reflect the influence of background selection10,77, an effect that is also confirmed in this work, as 372 (20.5%) of the 1816 considered SNPs are strain-specific.

The global elevated value of dN/dS ratio is commonly associated with a positive selection force, likely due to diversifying selection and local selective sweep. However, a reduction in effective population size might have contributed, partially, to this unusual rate of NS per synonymous mutations, once mutations that might have been deleterious in a population with a large effective population size can drift to a high frequency in a small population and, in that way, reflecting reduction in the efficacy of purifying selection as a consequence of increased genetic drift9,10.

The affected genes could confer important adaptive advantages through NS substitutions, however functional studies would be necessary to understand the consequences arising from those SNPs and to infer what would be the benefits for mycobacteria. Recent work performed by Yang and collaborators78 with M. tuberculosis strains suggested that this evolutionary pressure could allow accessory genes (i.e. genes that are not present in all strains or strain-specific genes) to gradually dominate and eventually become core genes (i.e. present in all strains)79. This could provide important adaptive and resistance capacities, if considering that accessory genes might be involved in virulence, immune system evasion or antibiotic resistance.

Therefore, a deeper understanding of the role of these evolutionary forces is required to determine which genes have contributed significantly to M. bovis evolution in its trajectory of interaction with different hosts in specific disease systems.

Read more here: Source link