Genozip is a software package (developed by yours truely) for compression of genomic files (BAM, FASTQ, VCF and others). It typically compresses 2x-5x better than .gz.
In my research projects, I am constantly spending too much time searching for the right reference file with the right contig names (eg chr22 vs 22; MT vs chrM etc) for BAM and VCF files on hand, or else various bioinformatics tools I use tend to break. So, I decided to solve this issue once and for good, by using Genozip.
Today, I released a new simple feature to handle this: with the command line option --match-chrom-to-reference
, your file’s contigs are updated to match those of the provided reference.
Example (notice contig 1
is converted to chr1
both in the header and in the data line):
> cat example.sam
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 1 9997 34 28M1I6M1I39M4D68M7S = 10159 324 CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:, AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33 NM:i:12 RG:Z:1
> genozip example.sam --reference hg19.p13.plusMT.full_analysis_set.ref.genozip --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)
> genocat example.sam.genozip
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701 99 chr1 9997 34 28M1I6M1I39M4D68M7S = 10159 324 CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:, AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33 NM:i:12 RG:Z:1
Documentation: genozip.com/match-chrom.html
Installing: genozip.com/installing.html
Publication: www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor
Read more here: Source link