storage – Good / recommended way to archive fastq and bam files?

The only free and open source tool I know that can help is zstd. Their github repository’s README describes it as:

Zstandard, or zstd as short version, is a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios. It’s backed by a very fast entropy stage, provided by Huff0 and FSE library.

We read this blog post on using it to compress fastq files and were intrigued so we ran some tests. We couldn’t reproduce the same level of compression that the article claims but that might be due to different input files or the way an index was/wasn’t created.

Even in our tests though, we did see that it provides better compression than gzip. Here are the specific values for one of the tests, run using an R1 file of a human WGS run:

Uncompressed fastq bgzip-compressed zstd-compressed
179G 58G 47G

This was done using -z 19, to set the compression level to 19 (we found this to be the best tradeoff between size and time to compress). We also saw that using a human genome fasta file to create an compression index, and then using that when compressing human fastq files actually resulted in worse performance (larger compressed file) than not using the index. But it is entirely possible that we just did it wrong, we didn’t spend a lot of time on it.

So you might want to investigate zstd as well, maybe you can get better performance if you play around with the indexing. On the plus side, it is free
and open source. On the other hand, it won’t do much with bam or cram files.

Read more here: Source link