Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage | BMC Bioinformatics

MSA is a parameter-free error correction method whose performance is determined by the sequence copies used (or sequencing depth). We encode (00-A, 01-T, 10-G, 11-C) a text file named “The Grandmother” into 140 DNA sequences of 120 bases (8 bases for index and 112 bases for data). The error rate p ranges from 1 to 40% and sequencing depth is at most 4000. While there are specific patterns where error occurrence is more likely, for simplicity, we assume the location and type of base errors introduced in synthesis and sequencing are random. Hence, the IDS errors are distributed uniformly at random. For simulation experiments, we assume that all reads have been clustered correctly and we focus on the reconstruction problem of these clustered sequences using MAFFT. For each error rate and sequencing depth, the average results are obtained from 100 repeated experiments.

The phase transition of the error correction capability

Results indicate that MAFFT’s error correction capability exhibits a phase transition. Figure 1 shows two phase transitions of the average recovery accuracy at two I:D:S ratios (1:1:1(a) and 1:1:2(b)), and the critical value is at about 20% error. When the error rate is less than the critical value, increasing sequencing depth may improve the average recovery rate (see the lower red region in Fig. 1). However, the needed sequencing depth dramatically increases as the error rate increases. On the contrary, the recovery accuracy drops steeply to even less than 50% in most cases when the error rate is larger than the critical value. And increasing sequencing depth could not improve the recovery accuracy (see the upper blue region in Fig. 1). This indicates that the synchronization errors have damaged the conserved structure among multiple sequences. Therefore, MSA could not capture any useful information from these sequences. These observations indicate that the limit of MAFFT is less than 20%.

Fig. 1
figure 1

The phase transition of the error correction capability of MAFFT. a I:D:S ratio being 1:1:1. b I:D:S ratio being 1:1:2. The values in the grids denote the average recovery accuracy at each error rate and sequencing depth

In addition, the average recovery rate is related to the IDS ratio. Under the same conditions in Fig. 1a, b, we can see that the average recovery accuracy at I:D:S ratio 1:1:1 is lower than that of I:D:S ratio 1:1:2. This is coincident with our intuition that assuming a successful synthesis, the more insertions/deletions (indels) in the sequenced reads, the harder it is to solve the synchronization problem.

The error correction capability

Although MAFFT has the potential to completely recover data at error rates < 20%, it is highly impractical in terms of reading speed and cost to sequence thousands of reads for just one sequence. Because of the issue of sequence loss in the sampling process, sequencing depth between 50 and 100 seems to be a reasonable range for large-scale application [6, 8]. In this section, we further conduct a detailed study on the capability of MAFFT below error rate 15%. Figure 2a shows the average recovery accuracy at error rate 1%-15% and the corresponding sequencing depth 5–500. Generally, MAFFT could correct more than 95% of the errors at sequencing depth 100 when the error rate is below 15%.

Fig. 2
figure 2

The error correction capability of MAFFT. a Recovery accuracy at each error rate and sequencing depth. b Minimum required sequencing depth for recovery accuracy 90% (green), 95% (blue), 99% (orange), and 100% (red). c The observed error correcting process of a sequence with 12% errors

Based on the error correction capability of MAFFT, we divide the error range into three regimes: the low (p ≤ 8%), medium (8% < p ≤ 15%) and high error regime (p > 15%). Here we focus on the low and medium regimes where it excels.

In the low error regime, MAFFT can completely recover the data. The red curve in Fig. 2b shows the required sequencing depth for 100% recovery accuracy. In this regime, it increases approximately linearly as the error rate increases. At p = 8%, MAFFT only needs 70 reads. Since we only used the simple quaternary encoding and constraints, this means that MAFFT can achieve complete recovery with reasonably little copies and minimal loss in logical density. Therefore, MSA may be a crucial part of the optimal solution to a reliable, fast, and low-cost DNA storage application in the low error regime.

In the medium error regime, MAFFT can correct more than 90% of the errors with a modest sequencing depth (≤ 100). The various curves in Fig. 2b show that, at a given error rate, the needed sequencing depth to reach higher accuracy increases dramatically. For example, at error rate 13%, to improve the accuracy from 90 to 95%, the needed sequence only increases from 30 to 35. However, this number increases from 35 to 60 in order to improve the accuracy from 95 to 99%. Figure 2c shows the error correcting process of a sequence with 12% errors. We observe that a significant portion of errors are still synchronization errors even with increased sequencing depth. This suggests that increasing sequencing depth beyond 60 seems to have negligible effect on correcting the few remaining indels. At this point, gaining higher accuracy at the expense of sequencing depth and reading speed may not be an efficient strategy. Therefore, it would be a wise alternative to combine MSA with other means, such as markers, ECC, or constraints on encoding. We could let MSA correct the vast majority of errors under a feasible sequencing depth, and then let other methods handle the rest. For example, we could let MSA handle 99% and 95% of the errors at p = 12% and 14%, where the needed sequencing depth is about 50. As most of these errors are solved by MSA, the added redundancy will be significantly reduced and thus still achieve a relatively high logical density approaching 2 bits/nt.

Impact of clustering accuracy

The error correction capability of MAFFT is robust to imperfect clustering. In our simulation experiment, each cluster has 100 reads and we experiment with 0%, 20%, and 40% of these reads being noisy sequences, which correspond to cluster accuracy of 1, 0.8, and 0.6 in Fig. 3, respectively. Additionally, Fig. 3 shows the average recovery accuracy at 10%, 12%, and 14% error rates after repeating the experiment 500 times. Although the average recovery accuracy slightly decreased when we added noisy reads, they are all still greater than 0.95 even when the clustering accuracy is as low as 0.6. This may be explained by the post-alignment voting process’s ability to retain the conserved information and filter out most of the noises. That is, MSA could maximally utilize the reads’ information, even if they contain some noises.

Fig. 3
figure 3

Recovery accuracy with respect to clustering accuracy at cluster size 100

Decoding performance on a real dataset

To test the ability of MAFFT on real data, we download a real dataset (github.com/microsoft/clustered-nanopore-reads-dataset) published by Microsoft group [30]. It contains 10,000 DNA sequences of length 110 synthesized by Twist Bioscience and amplified using polymerase chain reaction. The final clustered file contains 269,709 noisy nanopore sequencing reads. In order to correct errors, they added some marker repeat (MR) codes in the original binary data and then translated them into DNA sequences according to a quaternary code (00-A, 01-T, 10-G, 11-C). It is estimated that the percentage of insertions, deletions, and substitutions are roughly about 1.7%, 2%, and 2.2%, respectively.

MAFFT could correct most of the errors in these sequences with only modest sequencing depth. Figure 4a shows the average normalized Hamming distance with different sequencing depths, and Fig. 4b shows the normalized Hamming distance distribution at sequencing depth 20. The leftmost bar in Fig. 4b indicates that MAFFT could completely recover about 92% of these sequences. Although the normalized Hamming distance may range from 0 to 0.8, most errors are caused by only a few indels. On the other hand, there are still some sequences which could not be completely recovered even with increased sequencing depth. Figure 4c, d shows two typical alignment errors. The first one is caused by a deletion of base ‘G’ in most reads. The second one is caused by misaligning some base C to the next position. Obviously, the former could not be solved because a base deletion in most reads would be irrecoverable by MSA, while the latter may be improved by increasing sequencing depth. In addition, we find that most of these errors usually occur when there are homopolymers, such as the example in Fig. 4c, which are highly prevalent in this dataset. It is well-known that homopolymers tend to introduce more IDS errors. And it seems that MSA has more trouble dealing with errors introduced by homopolymers, perhaps due to the nature of the algorithm. Therefore, constraining the length of homopolymers may reduce such errors for MSA.

Fig. 4
figure 4

Recovery performance on a real nanopore sequencing dataset. a The average normalized Hamming distance at sequencing depth 5, 10, 15, and 20. b The distribution of normalized Hamming distance at sequencing depth 20. c An uncorrected example caused by base deletion in most reads. d An uncorrected example caused by misalignment

Read more here: Source link