Here we present the ProPIP software, which implements our originally published progressive MSA inference method based on PIP [7], and also introduces new features, such as stochastic backtracking and parallelisation (as described below). According to the PIP model, insertions are Poissonian events on a phylogeny that add single characters to a sequence. Once inserted, a character evolves via a continuous time Markov process of substitutions and deletions along the phylogeny relating the sequences. The intensity of insertions and deletions is parameterized by two rates that determine the type of homology and consequently the gap pattern in the final alignment. By modifying these parameters different homology hypotheses can be compared in a model-based framework. Thus, instead of the traditional gap penalties (which are typically set arbitrarily), the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment.
Algorithm scheme. At each internal node of the guide tree an instance of the algorithm aligns the homology paths of the two sub-alignments present in the children nodes. Each instance of the aligner requires 3 sparse three-dimensional DP matrices for Match, GapX and GapY (shown in the figure). These matrices contain the column by column likelihood of the 3 respective states. The state, among the 3 possible choices, with the highest likelihood is saved in a fourth traceback matrix (not shown). At the end, the final MSA is generated by traversing the traceback matrix backwards following the path that generates the highest likelihood at each step. GapX and GapY states require a marginalization of multiple possible homology paths. Some of these are highlighted in the figure. The existence of a character is illustrated by the coloured strokes on the trees
ProPIP can align both nucleotide and protein sequences. The overall complexity of our progressive algorithm is (mathcal {O}(Nl^3)), for N taxa and an average input sequence length l. Further running time reductions are possible. For example, recently we proposed a strategy to accelerate alignment inference by trimming the original DP matrix [8]. The method is implemented in the frequentist framework, where log-likelihood scores under PIP are used as an optimality criterion. In a progressive fashion, ProPIP traverses a guide tree phylogeny from the leaves towards the root according to one of the two different modes: (1) using the Dynamic Programming (DP), and (2) using the Stochastic Backtracking version (see also Fig. 1. These are briefly described below.
Dynamic programming
ProPIP proceeds progressively from leaves towards the root of a guide tree. By default, at each internal node the algorithm aligns the evolutionary histories in the left and right subtrees by full maximum likelihood (ML) using DP, to obtain the homology history at the current node. More specifically, at each node the likelihood computation marginalizes over all possible indel and substitution scenarios given the sub-alignments obtained for the child nodes in the previous steps of the progressive algorithm. This includes homology histories where all characters have been deleted, i.e. unobserved or “empty” columns. In a given node, any two MSA columns from the child nodes can be aligned in three ways: either they matched, or any of the two columns is aligned with a column full of gaps. Each of these three states can in turn imply a number of scenarios which, depending on the depth of the tree and the number of gaps present, can also be large. The algorithm computes the likelihood for each of the three scenarios, In particular, we consider all possible places where a character may have been inserted along the phylogeny and all possible points where it may have been deleted. All these homology paths are listed and marginalised into a single likelihood value, without having to make a choice of one scenario (e.g. based on parsimony or ML).
Our DP is locally optimal, i.e. in each internal node the two sub-alignments are aligned by full ML. Progressive application of DP, however, does not lead to a globally optimal solution. To overcome this greedy behaviour, we have enhanced our method with SB – stochastic backtracking [9], adapted to the PIP model.
Stochastic backtracking
SB provides an ensemble of sub-optimal candidate solutions, distributed according to their individual probabilities. During progressive alignment with the SB option, SB is applied at each internal node. Instead of aligning only the optimal histories at the two children, the aligner generates an ensemble (e.g. alignment.sb_solutions=4) of histories combining samples from the distributions at the children nodes. Therefore, the SB version of the algorithm reduces the chances to be trapped in local optima produced by the greedy nature of the default progressive DP.
SB is parameterised by a temperature T (e.g. alignment.sb_temperature=0.8), which tunes the deviation from the optimal alignment. For (T = 0) SB returns the optimal alignment, falling back to classical DP. By setting (T rightarrow infty), each alignment becomes equiprobable and the solution is therefore random. In the range (0< T < infty) the parameter controls the deviation from the optimal alignment allowing, gradually, the generation of sub-optimal alignments.
Substitution models
ProPIP can align either nucleotide (alphabet=DNA) or amino acid (alphabet=Protein) sequences, based on different substitution models available in the Bio++ library [10]. Among these are the nucleotide models are JC69, K80, HKY85, and GTR, and the amino acid models JTT, WAG, and LG. All models are extended with PIP. For a complete list of the substitution models available see the Bio++ documentation.
In addition, users can choose to account for Across-Site Rate Variation (ASRV), which is implemented as a discretised (Gamma) distribution (default), or alternatively as exponential or Gaussian distributions, with user-defined number of discrete categories.
Initial tree and indel rate inference
Providing a reasonable initial guide tree [11] and indel rates helps to make the MSA inference more accurate. These can be provided by the user when known. If the guide tree is not provided then ProPIP first computes a distance matrix from the pairwise alignments which is then used to infer a guide tree as a rooted BioNJ tree [10, 12].
The same applies to indel rates (insertion rate and deletion rate), which are inferred from the data when not provided by the user. We compute the initial indel rate values of the PIP model from pairwise alignments using the Needleman-Wunsch algorithm with gap opening and extension penalties for nucleotide sequences and a Grantham distance-based scoring method for amino acids [13]. The indel rates are calculated from the pairwise alignments as follows. The phylogeny and indel rate parameter values imply expectations on the number of gap/non-gap states (or gap patterns) in alignments. Each position in a pairwise alignment belongs to one of three possible patterns: either no gap is present, or a gap is present in one of the two sequences. Given that we need to estimate two parameters ((lambda) and (mu)) this leads to an overdetermined system of equations. We solve this system for each pairwise alignment using a non-linear least-squares algorithm [14, 15]. Then we take an average over all estimates to obtain the indel rates for the progressive alignment.
Finally, the various indel rates are averaged to obtain the initial insertion and deletion rate. The estimated indel rates eventually determine the resulting MSA gap pattern.
Parameter optimization
ProPIP allows the optimisation of model parameters, such as indel rates or the instantaneous substitution rates between characters. These features are inherited from Bio++ libraries. When requesting parameter optimisation, the system automatically instantiates the appropriate OptimizationTools class object. As input, this object receives a pointer to the likelihood function, which can be evaluated under PIP if the user wishes to invoke this evolutionary model. It is also possible to specify the maximum number of iterations or a tolerance value at which the optimisation ends. The user can monitor the optimisation progress and the final values in the two files “profiler” and “messenger”. Among the various Bio++ functions that ProPIP couples with the PIP model are the Brent and BFGS optimisation routines. In both cases the method optimises all parameters until convergence, respecting the requested thresholds or the maximum number of steps. If, on the other hand, the user desires to fix parameters at given values, this can be specified via the “None” optimisation option.
The syntax is the following: optimisation=ND-Brent(derivatives=Brent,nstep=1000). It is also possible to specify which parameters to ignore, for example if the user wants to optimise the insertion rate (lambda) and the deletion rate (mu) but not (kappa) of the K80 substitution model then the following should be specified: optimisation.ignore_parameter = K80.kappa. For more details see the wikipages on our github website and the Bio++ manual.
Parallelization
To reduce the computational time, ProPIP was parallelised. We use the open source version of Intel Thread Building Blocks library available at github.com/oneapi-src/oneTBB, which can be activated by the user (see documentation). The following parallelisation options are provided:
parallel_for: In this option, for-loops have been rewritten to exploit tbb::parallel_for loops provided by Intel TBB. This loop instruction allows to split the looping range into smaller chunks that are then executed in parallel by TBB’s tasks. This approach has been applied to the vector and matrix initialization loops and to the actual dynamic programming forward phase, where the likelihood matrices are computed and the maximum likelihood score is sought. To preserve the optimisation algorithms in the parallel execution context, the local optimum comparison and variable update have been protected using a locking mechanism (tbb::mutex). The necessity of this lock clearly influences and limits the achievable parallelism of this approach.
TBB Task: The TBB Task optimization provides more flexibility during the parallelization. Instead of parallelizing the internal loops this approach focuses to parallelize each node. The tree topology is processed in a post-order traversal, from the leaves towards the root of the tree. Starting from the root node each node creates 2 tasks (1 for each child node) and executes them in parallel before doing its own processing. This recursive process is repeated until the leafs of the binary tree are reached and the actual execution begins. Each node is executed as a separate parallel task, which leads to a more dynamic parallelism. Table 1 shows the speed-up values for some n-taxa trees and as the number of columns to be aligned increases. The speed-up factor improves when increasing the number of taxa while it remains constant when augmenting the number of columns.
Thread control: to have a better control of the parallel execution environment it is possible to limit the number of threads the TBB library will create and use to execute the parallel tasks. Finally, also thread pinning can be enabled, which allows to specify the CPUs the threads will be assigned to.
Read more here: Source link