kseq compatible DNA fastA/Q encoding and compression library : bioinformatics

Hi all!!

I would like to share with you this little project I have been working on for a while now. I would greatly appreciate if you find bugs or manage to break it with your own data.

sqzlib is a little fastA/Q encoding library that uses zlib or zstd as its compression engine.

github.com/7PintsOfCherryGarcia/sqzlib

In summary, sqzlib encodes DNA fastA/Q data using bit packing to encode nucleotides, runlength encoding for Ns and non ACGT nucleotides, and a combination of quality 8 binning + runlength encoding for qualities. Aided by zlib or zstd compression. sqzlib achieves very good compression ratios at fast runtimes. You con check the benchmark I have in the repo.

sqzlib uses it’s own format to store DNA fastA/Q sequences in “blocks”. Briefly, a number of sequences are packed into “data block”s that can be accessed independently from other blocks. So applications can be developed around the sqz format for multithreaded IO.

Most importantly, sqzlib is fully compatible with klib/kseq.h one of the highest performance fastA/Q parsers. This means that any application that uses kseq.h for fastA/Q parsing, can be easily modified to use sqzlib instead. You can find patched versions of seqstats, minimap2, and bwa-mem2 in my github, or you can patch them yourself with the included patches.

Disadvantages

sqzlib comes with some caveats:

  • Only works with DNA fastA/Q

  • non ACGT IUPACK nucleotides are converted to Ns

  • Quality 8 binning is non reversible

  • When encoding/decoding in multithreaded mode, the order of sequences might change

  • Masked bases are unmasked

  • Tested only on x86 GNU/Linux systems

Some of these issues will be addressed in the coming weeks. Specially the handling of masked bases.

A lot of works still remains:

  • Currently there is no low level API documentation, only kseq compatibility

  • There is no random sequence access yet

  • The project is in “functional” mode, but a lot of optimization is still needed.

  • Only zlib and zstandard are used as compression engines.

My main priorities now is to get the full API well documented as well as random sequence access.

Feedback would be greatly appreciated!!!

Here is a little benchmark of sqzlib compared to genozip on a 100k subsample of the NCBI NT blast database. Runtime and memory usage based on /usr/bin/time, comrassion ratio based on original file size:

Compression ratio

r/bioinformatics - sqzlib - kseq compatible DNA fastA/Q encoding and compression library

Runtime

r/bioinformatics - sqzlib - kseq compatible DNA fastA/Q encoding and compression library

Memory usage

r/bioinformatics - sqzlib - kseq compatible DNA fastA/Q encoding and compression library

Read more here: Source link