Accelerating Minimap2 for Accurate Long Read Alignment on GPUs


doi: 10.26502/jbb.2642-91280067.


Epub 2023 Jan 20.

Affiliations

Free PMC article

Item in Clipboard

Harisankar Sadasivan et al.


J Biotechnol Biomed.


2023.

Free PMC article

Abstract

Long read sequencing technology is becoming increasingly popular for Precision Medicine applications like Whole Genome Sequencing (WGS) and microbial abundance estimation. Minimap2 is the state-of-the-art aligner and mapper used by the leading long read sequencing technologies, today. However, Minimap2 on CPUs is very slow for long noisy reads. ~60-70% of the run-time on a CPU comes from the highly sequential chaining step in Minimap2. On the other hand, most Point-of-Care computational workflows in long read sequencing use Graphics Processing Units (GPUs). We present minimap2-accelerated (mm2-ax), a heterogeneous design for sequence mapping and alignment where minimap2’s compute intensive chaining step is sped up on the GPU and demonstrate its time and cost benefits. We extract better intra-read parallelism from chaining without losing mapping accuracy by forward transforming Minimap2’s chaining algorithm. Moreover, we better utilize the high memory available on modern cloud instances apart from better workload balancing, data locality and minimal branch divergence on the GPU. We show mm2-ax on an NVIDIA A100 GPU improves the chaining step with 5.41 – 2.57X speedup and 4.07 – 1.93X speedup : costup over the fastest version of Minimap2, mm2-fast, benchmarked on a Google Cloud Platform instance of 30 SIMD cores.


Keywords:

Chaining; GPU; Minimap2; Nanopore; Sequence alignment.

Figures


Figure 1:



Figure 1:

Minimap2 operates in 3 main steps: seeding, chaining and base-level alignment. Our optimizations to chaining are shown in blue box. Boxes with green fill show chaining sub-tasks which we perform on the GPU instead of CPU.


Figure 2:



Figure 2:

Chaining explained. (a) In Minimap2, every current anchor (A2 (r2, q2, l2) in this case) attempts to sequentially chain its predecessors within a pre-calculated predecessor range. If the chaining score with a predecessor is greater than the score value stored at current anchor A2, the new chain score and index of the predecessor is updated at A2 (in the direction of the arrow). (b) The chain score with a predecessor is computed from anchor gap cost (evaluated as a function of reference_gap, query gap and average length of all anchors) and overlap cost.


Figure 3:



Figure 3:

Summary of approximate time spent in seed-chain-align. mm2 takes longer to map long noisy ONT reads and spends a greater percent of total mapping time in chaining. X-axis shows the sequencing technology with mean read length of each sets of 100K randomly sub-sampled reads.


Figure 4:



Figure 4:

Forward transforming predecessor range selection to successor range selection: The cell with solid black outline represents the current anchor for which predecessor/successor range calculation is performed. The arrow starts from the predecessor/successor and points to the current anchor A3 whose range is updated sequentially.


Figure 5:



Figure 5:

Parallelizing Minimap2’s chain score generation (shown in a) by forward transformation (shown in b). Additionally, we retain mapping accuracy by modifying the score comparison check (> to >=) with all anchors except the immediate neighbor to enable farther anchors to take precedence over neighboring anchors to be forward chained.


Figure 6:



Figure 6:

Workload is sparse and irregular. (a) ~67% of anchors fed to the chaining step do not start a chain. (b) Predecessor range is less than or equal to 16 for ~92% of all anchors and goes as high as 5000 only for a small fraction of total anchors.


Figure 7:



Figure 7:

(a) mm2-ax yields 5.41 – 2.57X speedup and 4.07 – 1.93X speedup : costup over SIMD-vectorized mm2-fast baseline. (b) The chaining performance across various read lengths may be further improved by 1.3-2.3X if we can engineer to hide the data transfer related costs.


Figure 8:



Figure 8:

mm2-ax is memory bound. (a)Roofline Plot: Chain score generation is memory bound. Operating point is shown in a red cross mark. (b) Theoretical warp occupancy on the GPU is bounded by the number of registers used by each thread.

Similar articles

References

    1. Mantere T, Kersten S, Hoischen A. Long-read sequencing emerging in medical genetics. Front. genetics 10 (2019): 426.



      PMC



      PubMed

    1. Gorzynski JE, Sneha D Goenka, Shafin BS, et al. Ultrarapid nanopore genome sequencing in a critical care setting. The New Engl journal medicine 386 (2022): 700–702.



      PubMed

    1. Amarasinghe SL, Su S, Dong X, et al. Opportunities and challenges in long-read sequencing data analysis. Genome biology 21 (2020): 1–16.



      PMC



      PubMed

    1. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J. molecular biology 215 (1990): 403–410.



      PubMed

    1. ONT products. Oxford Nanopore Technologies (2022).

Read more here: Source link