Paired-end reads reported without mates: how to play matchmaker?

Hi Everyone,

I am currently looking at Acute Myeloid Leukemia (AML) paired-end WGS samples from the TARGET data ocg.cancer.gov/programs/target/target-methods#3241.

A bioinformatician in our group remapped the samples from hg19 to hg38. Unfortunately, we do not have any copies of the hg19 version anymore.

However, when I try to run anything with the BAMs, I get an error that the mates of the paired-end sequences are not found. I have run Picard’s ValidateSam and samblaster –AddMateTags to try to fix it, with no avail.

samblaster: Found   1033191258 of 1038680280 (99.472%) total read ids are marked paired yet are unmated.

This is just a subset of the data I am interested in (chrM reads), ValidateSam would not work for the WGS bam:

 HISTOGRAM    java.lang.String
    Error Type    Count
    ERROR:INVALID_VERSION_NUMBER    1
    ERROR:MATE_NOT_FOUND    304414

This is how the reads look like. As you can see, they miss the MC and MQ tags.

SRR1168035.161988987.1    83    MT    1178    40    100M    =    1035    -243        DDDDDEEDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEDCBFHJJIGEHFGIGBF@HD?IHCIIJJIIJJJJJJJJJGHJIIJHHHHHFDDA4#CC    NM:i:1    AS:i:98    XS:i:98    RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001


SRR1168035.161988987.2    163    MT    1035    40    100M    =    1178    243        @CCFFFFFHHHHHJGIIJJJEHIGIGIEGIJJGGIGEIIGIJJJD#####00<FHI#.;DEHGGHHHE@BC#############################    NM:i:14    AS:i:72    XS:i:72    RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001

My question basically is: How can I fix the mate information? I was wondering whether I could write a simple script that could go through the queryname-sorted file, and then take the lines with the same name (xxxx.1 and xxxx.2) and based on those lines create the MC and MQ columns, or is there a program / script out there that already does this?

Kind regards,

Jip

Read more here: Source link