SamQL: a structured query language and filtering tool for the SAM/BAM file format | BMC Bioinformatics

Our primary aim building SamQL was flexibility and high expressivity for complex queries, similar to classic SQL. Table 1 compares the expressivity of SamQL for a relatively complex query against other widely used tools such as SAMtools, Sambamba, and naive Bash. SamQL maintains consistency on complex queries involving coordinates. However, Sambamba and samtools separate the range query from the main filtering criteria. This leads to less consistent syntax as highlighted in the example in Table 1.

Table 1 Comparison of syntax for a reasonably complex query to keep reads on chromosome 2, mapping at position greater than 1,000,000, on the reverse strand, with mapping quality higher than 30

SamQL is a complete query language with a lexer and parser specifically designed for genomic data in the SAM/BAM format. This enables SamQL to output more informative messages for potential syntax errors. Table 2 shows a specific example for a simple syntax error involving a misplaced closing parenthesis. SamQL returns information about the specific type of the error and the position at which the error occurred. In contrast, neither samtools nor sambamba provide such information.

Table 2 Comparison of the error messages of the different tools for a syntax error involving a misplaced parenthesis

Besides expressivity we also wished to test the computational performance of SamQL. We compared SamQL against other widely used tools; SAMtools, Sambamba, BamTools and naive Bash, essentially approaching daily used workflows by bioinformaticians. We tested on three different queries, each one with varying levels of syntax complexity and computational requirements. We repeated each query 10 times on varying input sizes to validate the variation and accuracy of our measurements. All comparisons were run on an Intel Xeon, 48 core, 384 Gb memory compute server. We used a publicly available BAM file with 3,363,576 records. To evaluate scalability in terms of input size, we randomly sampled the input BAM file into sizes of 10% increments and performed measurements on all subsets individually. To evaluate the query performance and decouple it from Input/Output (IO) we measured the execution time both when printing to an output file but also just counting the filtered reads. We also tested SamQL by limiting it to 2 execution threads or leaving it unbound to automatically scale to the available resources.

As a first test, we decided to compare performance on filtering against a SAM tag that uses string matching and is supported by all methods, CC:Z. Filtering on optional tags forces all methods to read the entire SAM record thus decoupling IO optimizations that depend on skipping optional SAM fields. We find that SamQL performs on par with the other methods even when bound to use just two threads, one for IO and one for compute (Fig. 2A). As a next test we wished to filter on the NH:i tag that involves numerical comparisons. This is an intuitive and straightforward query change in SamQL (Fig. 2B, top) and Sambamba. However, for SAMtools we had to use the newly added -e option instead of the -d to support numeric comparisons. As expected, the naive Bash implementation becomes substantially more complex also raising the possibility for coding mistakes. Again, we find that SamQL execution time is on par or faster than other methods and that execution time again increases linearly with time (Fig. 2B). Overall, we find that most methods perform similarly, except BamTools that is substantially slower than others. We also notice that SamQL multithreading does speed up execution time but only modestly. Importantly, we find that execution time increases linearly with the input size which is a key feature that allows SamQL to be used for large datasets.

Fig. 2

SamQL performance. AC From left to right, the plots correspond to runtime for printing and counting SAM entries on increasing subsets of the input data for different queries. The used queries are: A string query on tag “CC:Z” (A); a numeric query on tag NH:i (B); A range query (C). The corresponding SamQL query is shown at the top. D Parallelization performance for BAM output for a range query (left) and the NH:i tag (right) on a large BAM dataset of approximately 900 million reads. For NH:i only 10% of the file was processed to keep the execution times reasonable. SamQL is inherently concurrent and cannot be limited to less than 2 threads which is why performance is equivalent at 1 and 2 threads. Colors correspond to SamQL using all threads (dark blue), SamQL bound to 2 cores (green), SAMtools (light green), naive Bash/Awk (cyan), BamTools (yellow) and Sambamba (red). The raw data for the plots can be found in Additional file 1: Table S1.

Next, we evaluated the SamQL performance for a reasonably complex range query. Range queries on genomic or transcriptomic coordinates are among the most used query types in bioinformatics analyses. Therefore, BAM files are usually indexed to achieve fast retrieval of alignments that overlap a given region [16]. SamQL can execute range queries on indexed and not indexed BAM or SAM files, albeit much faster when indexing exists. Our data show that SamQL performs on par with the other software and orders of magnitude faster than a Bash approach (Fig. 2C).

The file that we used for evaluating performance although it serves for micro-benchmarks and we have seen linear scalability with size, it does not necessarily simulate real-world usage that might involve tens or hundreds million reads. Therefore, we performed similar benchmarks for a publicly available file of much larger size, approximately 900 million reads. Our results again show that range queries for all tools are executed much faster than naive Bash (Additional file 1: Fig. S1A) and comparable with each other. To gain more insight as to the scalability of the tools to very large datasets and to measure parallelization performance more accurately, we used this larger dataset to run all tools at different parallelization settings, progressively increasing the thread usage. We scaled the tools from 1 to 32 threads to monitor how efficiently the tools use the available resources. We tested a range query on the full file and the filter on the NM:i tag on a smaller, 10%, subset of the file due to very high runtimes for all tools. Initially we focused on SAM output. Interestingly we found that SamQL does not scale as well as the other tools reaching a performance plateau at approximately 4 threads (Additional file 1: Fig. S1B). However, we find that when BAM output is requested the scalability of the tool improves similarly to SAMtools and Sambamba. This indicates that BAM parallelization at the output provides substantial benefits. As expected, we find that the runtime of all tools reduces but with diminishing returns at they get close to 32 threads. Overall, all our tests indicate that SamQL offers high expressivity for complex queries while also achieving high performance and being able to utilize and take advantage of parallel computing.

Read more here: Source link