Sorting a STAR-output bam file with samtools leads to a significant reduction in size of the sorted bam file

I am not too familiar with the subject, neither with details of compression algorithms and nor the exact specification of the BAM format, but I think it is just a matter of more or less efficient compression.

Various implementations of zlib exist and the samtools authors have run some benchmarks showing significant differences, at least for lower compression levels.

But even with the exact same implementation and compression level, sizes of compressed files may vary, if their content is reordered. The reason is, that all compression works by storing similar content only once and referencing this piece of information at multiple locations. By design, zlib is limited to a 32 KB window, which was a sensible choice in the early ’90s, when it was conceived. So if sorting brings more similar content closer together and into that 32kb window (e.g. by a different handling of mates or secondary alignments), it allows for a more efficient compression. This is nicely exemplified by clumpify from the BBTools suite, which can shrink compressed FastQ files by another 50% or so just by reordering reads by sequence similarity.

Since today’s computers can hold much more information in memory, even in mobile and embedded environments, the window for compression algorithms can be increased for a way better performance. Facebook’s Zstandard, for example, has no inherent limit and can address terabytes of memory. It mostly operates on something between 1 – 8 MB, though. However, it has a long range mode --long that allows for a maximum window size of 2GB. You would probably be surprised how much that can shrink a Fasta file of some reference genome interspersed with long transposons. Also zstd --train is a really cool feature, because it builds a dictionary of repeating patterns informed by the content that needs to be compressed. Once trained, using a dictionary works really well, if thousands of similar, small files need to be compressed individually.

But I’m getting off topic – ultimately you just wanted to know if everything was fine with your BAM files, and it is 😉

Read more here: Source link