UMI workflow resulting in bams with empty reads

Hello all,

In my NGS workflow for UMI based reads, I first tried identifying and removing sequence adapters using bbmerge and cutcadapt:

BBMERGE -Xmx1g -ignorejunk in1=SAMPLE_R1  in2=SAMPLE_R2 outa= adapters.fa itn
CUTADAPT -a forward_adapter -A reverse_adapter -o s_2_1_sequence_trimmed_UN.fastq.gz -p s_2_2_sequence_trimmed_UN.fastq.gz SAMPLE_R1 SAMPLE_R2

Then, I converted the trimmed fastq files to an unaligned bam:

java -Xmx4g -jar PICARD FastqToSam \
    O=SAMPLE_PFX_UN_unal.bam \
    F1=s_2_1_sequence_trimmed_UN.fastq.gz \
    F2=s_2_2_sequence_trimmed_UN.fastq.gz \
    SM=SAMPLE_PFX  \
    LB=null \
    PU=NA \
    PL=Illumina

and extracted the umi barcodes from these bams:

java -Xmx4g -jar FGBIO ExtractUmisFromBam \
    --input=SAMPLE_PFX_UN_unal.bam \
    --output=SAMPLE_PFX_UN_unal_umi_extracted.bam \
    --read-structure=5M2S+T 5M2S+T \
    --molecular-index-tags=ZA ZB \
    --single-tag=RX

When I use ValidateSamFile on the output produced above, I see that there are 8 empty reads in the bam:

ERROR:EMPTY_READ        8

which errors out when I try to perform the next step of converting the unaligned bam back to fastq:

  java -Xmx4g -jar PICARD SamToFastq \
        I=SAMPLE_PFX_UN_unal_umi_extracted.bam   \
        F=SAMPLE_PFX_UN_unal_umi_extracted.fastq \
        INTERLEAVE=true

The error seen:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record XXXXXXX, Read name XXXXXXX, Zero-length read without FZ, CS or CQ tag

It appears that removal of sequence adapters and UMI barcodes has resulted in empty reads that cause the error. Has anyone experienced this and are there any workarounds?

Thank you

Read more here: Source link