Hi Everyone,
I am currently looking at Acute Myeloid Leukemia (AML) paired-end WGS samples from the TARGET data ocg.cancer.gov/programs/target/target-methods#3241.
A bioinformatician in our group remapped the samples from hg19 to hg38. Unfortunately, we do not have any copies of the hg19 version anymore.
However, when I try to run anything with the BAMs, I get an error that the mates of the paired-end sequences are not found. I have run Picard’s ValidateSam and samblaster –AddMateTags to try to fix it, with no avail.
samblaster: Found 1033191258 of 1038680280 (99.472%) total read ids are marked paired yet are unmated.
This is just a subset of the data I am interested in (chrM reads), ValidateSam would not work for the WGS bam:
HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_VERSION_NUMBER 1
ERROR:MATE_NOT_FOUND 304414
This is how the reads look like. As you can see, they miss the MC and MQ tags.
SRR1168035.161988987.1 83 MT 1178 40 100M = 1035 -243 DDDDDEEDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEDCBFHJJIGEHFGIGBF@HD?IHCIIJJIIJJJJJJJJJGHJIIJHHHHHFDDA4#CC NM:i:1 AS:i:98 XS:i:98 RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001
SRR1168035.161988987.2 163 MT 1035 40 100M = 1178 243 @CCFFFFFHHHHHJGIIJJJEHIGIGIEGIJJGGIGEIIGIJJJD#####00<FHI#.;DEHGGHHHE@BC############################# NM:i:14 AS:i:72 XS:i:72 RG:Z:PAMYMABMNF_BCCAGSC_S1_L001_001
My question basically is: How can I fix the mate information? I was wondering whether I could write a simple script that could go through the queryname-sorted file, and then take the lines with the same name (xxxx.1 and xxxx.2) and based on those lines create the MC and MQ columns, or is there a program / script out there that already does this?
Kind regards,
Jip
Read more here: Source link