Annoyed by contig name inconsistencies (eg chr22 vs 22)? A solution

Genozip is a software package (developed by yours truely) for compression of genomic files (BAM, FASTQ, VCF and others). It typically compresses 2x-5x better than .gz.

In my research projects, I am constantly spending too much time searching for the right reference file with the right contig names (eg chr22 vs 22; MT vs chrM etc) for BAM and VCF files on hand, or else various bioinformatics tools I use tend to break. So, I decided to solve this issue once and for good, by using Genozip.

Today, I released a new simple feature to handle this: with the command line option --match-chrom-to-reference, your file’s contigs are updated to match those of the provided reference.

Example (notice contig 1 is converted to chr1 both in the header and in the data line):

> cat example.sam

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701       99      1       9997    34      28M1I6M1I39M4D68M7S     =       10159   324     CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA  FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:,  AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33     NM:i:12 RG:Z:1

> genozip example.sam --reference hg19.p13.plusMT.full_analysis_set.ref.genozip --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)

> genocat example.sam.genozip

@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701       99      chr1    9997    34      28M1I6M1I39M4D68M7S     =       10159   324     CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA  FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:,  AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33     NM:i:12 RG:Z:1

Documentation: genozip.com/match-chrom.html

Installing: genozip.com/installing.html

Publication: www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

Read more here: Source link