Detailed population structure of L1–4 and a hierarchical sub-lineage naming system
We assembled a high-quality data set of whole genomes, antibiotic resistance phenotypes, and geographic sites of isolation for 9584 clinical Mtb samples (“Methods” section and Supplementary Data 1). Of the total, 4939 (52%) were pan-susceptible, i.e., susceptible to at least isoniazid and rifampicin (and all other antibiotics when additional phenotypic data were available), and 4645 (48%) were resistant to one or more antibiotics (Supplementary Fig. 1a). Using the 62 SNS lineage barcode6, 738 isolates were classified as L1 (8%), 2193 as L2 (22%), 1104 as L3 (12%) and 5549 as L4 (58%, Supplementary Fig. 1b). Among the 4939 pan-susceptible isolates, we identified high-quality genome-wide SNSs (83,735 for L1, 56,736 for L2, 76,817 for L3, and 185,622 for L4) that we used in building maximum-likelihood phylogenies for each major lineage (L1–4, “Methods” section). We computed an index of genetic divergence (FST) between groups defined by each bifurcation in each phylogeny. Sub-lineages were defined as monophyletic groups that had high FST (>0.33) and were also clearly separated from other groups in principal component analysis (PCA, see “Methods” section). We also defined internal groups to sub-lineages (see “Methods” section): an internal group is a monophyletic group genetically divergent (by FST and PCA) from its neighboring groups, but has one or more ancestral branches that show a low degree of divergence or low support (bootstrap values). Internal groups do not represent true sub-lineages in a hierarchical fashion, but defining them allows us to further characterize the Mtb population structure. We provide code to automate all the steps described above. Our approach is scalable and can be used on other organisms (see “Methods” section).
To better classify Mtb isolates in the context of the global Mtb population structure, we developed a hierarchical sub-lineage naming scheme (Supplementary Data 2) where each subdivision in the classification corresponds to a split in the phylogenetic tree of each major Mtb lineage. Starting with the global Mtb lineage numbers (e.g., L1), we recursively introduced a subdivision (e.g., from 1.2 to 1.2.1 and 1.2.2) at each bifurcation of the phylogenetic tree whenever both subclades sufficiently diverged. Formally, we defined these splits using bootstrap criteria, and independent validations by FST and PCA (see “Methods” section). Internal groups were denoted with the letter “i” (e.g., 4.1.i1). This proposed system overcomes two major shortcomings of the existing schemas: same-level sub-lineages are never overlapping (unlike the system of Stucki et al.8 sub-lineage 4.10 includes sub-lineages 4.7–4.9), and the names reflect both phylogenetic relationships and genetic similarity (unlike semantic naming such as the “Asia ancestral” lineage in the system of Shitikov et al.7). Further, this naming system can be standardized to automate the process of lineage definition. These advantages come at the price of long sub-lineage names in the case of complex phylogenies (e.g., for L4, sub-lineage 4.10 gets the lineage designation 184.108.40.206.220.127.116.11). For compatibility with naming conventions already in use and to keep names as short as possible, we designed a second, shorthand, naming system that expands the Coll et al. lineage schema by adding new subdivisions and differentiating between sub-lineages and internal groups. For instance, sub-lineage 4.3.1 is designated as 4.3.i1, informing the user that this is an internal group of sub-lineage 4.3. To simplify the use of the hierarchical naming schema and the updated shorthand schema, we provide a table that compares them side by side along with naming systems currently in use (Supplementary Data 2).
Using the sub-lineage definition rules and the sub-lineage naming scheme described above, we characterized six previously undescribed sub-lineages of L1 (Fig. 1 and Supplementary Fig. 2); five of which expand the current description of 1.2. We also detected an internal group of 91 isolates (1.1.3.i1) characterized by a long defining branch in the phylogeny (corresponding to 82 SNSs), a high FST (0.48), and geographically restricted to Malawi (85/91, 93% isolates, Fig. 1 and Supplementary Fig. 3). We estimated the date of the emergence of the MRCA of such a group (see “Methods” section) and we found it to be between 1497 and 1754. We found four previously undescribed sub-lineages of L3 (Fig. 2 and Supplementary Fig. 4), revising L3 into four main groups, whereas previously only two partitions of one sub-lineage were characterized (3.1). We found that the latter two partitions are in fact internal groups of the largest sub-lineage (3.1.1) in our revised classification.
L2 is divided into two groups: proto-Beijing and Beijing with the latter in turn partitioned into two groups: ancient- and modern-Beijing7. Each one of these groups is characterized by further subdivisions (three for the ancient-Beijing group and seven for the modern-Beijing group; see Supplementary Fig. 4). We found a new sub-lineage (18.104.22.168, Fig. 3, and Supplementary Fig. 5) within the previously characterized ancient-Beijing group. However, genetic diversity within the modern-Beijing group (22.214.171.124.1) was lower than in the other L2 sub-lineages and the tree topology and FST calculations did not support further hierarchical subdivisions. Although we did find three internal groups of modern-Beijing: two undescribed and one that corresponds to the Central Asia group7. For L4, our results support a complex population structure with 21 sub-lineages and 15 internal groups. In particular, we found 11 previously undescribed sub-lineages and 5 internal groups that expand our understanding of previously characterized sub-lineages (e.g., 4.2.2; 4.2 in the Coll et al. classification) or that were not characterized since these isolates were simply classified as L4 (e.g., 4.11, Fig. 4, and Supplementary Fig. 6) using the other barcodes.
A new barcode to define L1–4 Mtb sub-lineages and a software package to type Mtb strains from WGS data
We defined a SNS barcode for distinguishing the obtained sub-lineages (Supplementary Data 3). We characterized new synonymous SNSs found in 100% of isolates from a given sub-lineage, but not in other isolates from the same major lineage, compiling 95 SNSs into an expanded barcode (Supplementary Data 3). We validated the barcode by using it to call sub-lineages in the hold-out set of 4645 resistant isolates and comparing the resulting sub-lineage designations with maximum-likelihood phylogenies inferred from the full SNS data (Supplementary Figs. 7–10). A sub-lineage was validated if it was found in the hold-out data and formed a monophyletic group in the phylogeny. Considering the “recent” sub-lineages, i.e., the most detailed level of classification in our system, we were able to validate eight out of nine L1 sub-lineages including five out of six of the new sub-lineages described here, with the exception of 126.96.36.199. We validated all four new L3 sub-lineages, all five L2 sub-lineages including the one previously undescribed, and 16 of the 21 L4 sub-lineages including two described here. The sub-lineages we could not confirm were not represented by any isolate in the validation phylogenies. We did not observe any paraphyletic sub-lineages in the revised classification system.
We developed fast-lineage-caller, a software tool that classifies Mtb genomes using the SNS barcode proposed above. For a given genome, it returns the corresponding sub-lineage as output using our hierarchical naming system in addition to four other existing numerical/semantic naming systems, when applicable (see “Methods” section). The tool also informs the user on how many SNSs support a given lineage call and allows for filtering of low-quality variants. The tool is generalizable and can manage additional barcodes defined by the user to type the core genome of potentially any bacterial species.
Geographic distribution of the Mtb sub-lineages
Next, we examined whether certain sub-lineages were geographically restricted, which would support the Mtb-human co-evolution hypothesis, or whether they constituted prevalent circulating sub-lineages in several different countries (i.e., geographically unrestricted)8. We used our SNS barcode to determine the sub-lineages of 17,432 isolates (see “Methods” section) sampled from 74 countries (Supplementary Fig. 11 and Supplementary Data 4, 5). We computed the Simpson diversity index (Sdi) as a measure of geographic diversity that controls for variable sub-lineage frequency (see “Methods” section) for each well-represented sub-lineage or internal group (n > 20). We hypothesized that geographically unrestricted lineages would have a higher Sdi. We found Sdi to correlate highly (⍴ = 0.68; p-value = 5.7 × 10−7) with the number of continents from which a given sub-lineage was isolated (Supplementary Fig. 12). The Sdi ranged between a minimum of 0.05 and a maximum of 0.72, with a median value of 0.46 (Fig. 5). The known geographically restricted sub-lineages8 had an Sdi between 0.28 and 0.5 (Fig. 5 and Supplementary Table 1), while the known geographically unrestricted sub-lineages8,9 had an Sdi between 0.55 and 0.61 (Fig. 5 and Supplementary Table 2). We found 11 sub-lineages/internal groups with Sdi <0.28 (Supplementary Table 3), and 11 sub-lineages/internal groups with Sdi >0.61 (Supplementary Table 4), i.e., more extreme than previously reported geographically restricted or unrestricted sub-lineages, respectively.
While the currently known geographically restricted sub-lineages are all in L4, we found evidence of geographic restriction for two sub-lineages/internal groups of L1. The first, the L1 internal group 1.1.3.i1, showed a very low Sdi (0.06) and was only found at high frequency among the circulating L1 isolates in Malawi (Fig. 6). This finding is also in agreement with the L1 phylogeny (Fig. 1) that shows a relatively long (82 SNS) branch defining this group. The second geographically restricted L1 sub-lineage is 188.8.131.52 (Sdi = 0.12) that was only found at high frequency among circulating L1 isolates in South-East Asia (Vietnam and Thailand, Fig. 7). To exclude the possibility that these two groups appeared geographically restricted as a result of oversampling transmission outbreaks, we calculated the distribution of the pairwise SNS distance for each of these two sub-lineages. We measured a median SNS distance of 204 and 401, respectively, refuting this kind of sampling error for these groups (typical pairwise SNS distance in outbreaks <15–20 SNS24) (Supplementary Fig. 13).
The geographically unrestricted sub-lineages spanned several recognized generalist sub-lineages or internal groups (e.g., 4.1.i184.108.40.206, 4.3, 4.10, and 220.127.116.11.1, which correspond to 4.1.2/Haarlem, 4.3/LAM, 4.10/PGG3, modern-Beijing in the Stucki et al.8 or the Shitikov et al.7 classification). In addition, we found a candidate geographically unrestricted sub-lineage of L1 (1.1.2). L1.1.2 spanned 253 isolates from 7 countries and 4 continents and its Sdi was 0.61 (Fig. 8). Overall we found a low, but significant correlation between Sdi and the number of isolates sampled from each sub-lineage (⍴ = 0.34; p-value = 0.03, Supplementary Fig. 14) indicating that the most prevalent sub-lineages were more likely to be geographically unrestricted.
To validate the geographic distribution of sub-lineages, we curated a data set of 3791 Mtb isolates sampled to reliably represent the entire population of TB patients in five countries (Azerbaijan, Bangladesh, Pakistan, South Africa, and Ukraine, Supplementary Data 6, Zignol et al. data set). In this data set, we were able to identify 9 of the 30 new groups/sub-lineages described above. L1 and 1.1.2 isolates were found in three out of five countries (Azerbaijan, Pakistan, and Ukraine) spanning two continents2 (Supplementary Figs. 15 and 16) supporting the hypothesis that L1.1.2 is a geographically unrestricted L1 sub-lineage. The candidate geographically restricted group 1.1.3.i1 from Malawi, was only additionally found in South Africa (Supplementary Fig. 16). L2 isolates were found in all five countries with 98% belonging to the modern-Beijing group (Supplementary Figs. 15 and 17). The modern-Beijing internal group 18.104.22.168.1.i3, corresponding to the Central Asia group7, was the most prevalent L2 group in Ukraine and Azerbaijan and the second most prevalent group in Pakistan. This confirms 22.214.171.124.1.i3’s observed high Sdi (0.66) in the convenience training data and is in line with modern-Beijing’s transmissibility (next paragraph and ref. 21). L3 isolates were found in four (Azerbaijan, Bangladesh, Pakistan, South Africa) of the five countries (Supplementary Fig. 15). Sub-lineage 3.1.1 was the most prevalent sub-lineage (Supplementary Fig. 18), but the two new L3 sub-lineages we describe (3.2.1 and 3.1.2) were also observed at low frequency (2.4-4.8%) in Bangladesh and Pakistan. L4 isolates and most commonly 4.1, 4.3, and 4.10 (that correspond to 4.1, 4.3/LAM, and 4.10/PGG in the Stucki et al.8 classification; Supplementary Fig. 19) were found in all five countries in line with their Sdi >0.67 and results on L4 transmissibility below.
Differences in transmissibility between the Mtb global lineages
The observation that some lineages/sub-lineages are more geographically widespread than others raises the question of whether this results from differences in marginal transmissibility across human populations. On a topological level, we observed L2 and L3 phylogenies to be qualitatively different from those of L1 and L4 (Figs. 1–4): displaying a star-like pattern with shorter internal branches and longer branches near the termini. We confirmed this quantitatively by generating a single phylogenetic tree for all 9584 L1–4 isolates and plotting cumulative branch lengths from root to tip for each main lineage (Supplementary Fig. 20). Star-like topologies have been postulated to associate with rapid or effective viral or bacterial transmission e.g., a “super-spreading” event in outbreak contexts25. To compare transmissibility between the four lineages, we compared the distributions of terminal branch lengths expecting a skew toward shorter terminal branch lengths supporting the idea of higher transmissibility. We found L4 to have the shortest median terminal branch length, followed in order by L2, L3, and L1 (medians: 6.2 × 10−5, 8.2 × 10−5, 10.2 × 10−5, 17.5 × 10−5, respectively; all pairwise two-sided Wilcoxon rank-sum tests significant p-value < 0.001; Fig. 9). Shorter internal node-to-tip distance is a second phylogenetic correlate of transmissibility; the distribution of this measure across the four lineages revealed a similar hierarchy to the terminal branch length distribution (Supplementary Fig. 21). We also computed the cumulative distribution of isolates separated by increasing total pairwise SNS distance (Supplementary Fig. 22). The proportion of L4 isolates separated by <5 SNS was highest and followed by L2, L3, and L1, respectively. Thus, despite the topological similarities between the L3 and L2 trees, L3 was measured to be less transmissible by interrogating the topology of the trees.
To confirm that the measured transmissibility differences are not due to sampling bias in the source data, we compared the distributions of terminal branch lengths for the four major lineages using the susceptible isolates only (n = 4939/9584; data set with curated phenotypes) and using the Zignol et al. data set, where the isolates have been randomly sampled in five countries. In both cases, we found that L4 and L2 have the shortest median terminal branch lengths and L1 the longest. In the Zignol et al. data set, we also found that L4 and L2 terminal branch lengths were shorter than L3’s (Supplementary Information, Supplementary Figs. 23 and 24).
Read more here: Source link