Count strand-specific 5′ / 3′ coverage across whole genome for paired-end fragments

I’d like to use bedtools to create strand-specific per-base coverage profiles for my paired-end bam alignments of stranded RNA-seq data. Existing Biostars answers here and here suggest using bedtools genomecov with the -5 option to extract only coverage from the 5′ ends of fragments. These are the commands that I’ve tried to implement with bedtools v2.29.2:

bedtools genomecov -pc -3 -d -strand + -ibam sample.bam > sample_plus_3p.txt
bedtools genomecov -pc -3 -d -strand - -ibam sample.bam > sample_minus_3p.txt
bedtools genomecov -pc -5 -d -strand + -ibam sample.bam > sample_plus_5p.txt
bedtools genomecov -pc -5 -d -strand - -ibam sample.bam > sample_minus_5p.txt

My expectation when running these commands is that I will get a per-base genome coverage reports (-d) for the ends (-3/-5) of my paired-end fragments (-pc) in a strand-specific manner (-strand), with zeros given at all positions where the end of a paired-end read does not fall. No error is given running the above commands, and I indeed get a report listing every position in my genome. However, checksum values for the different files indicate that the -3 and -5 options are giving the same results.

Input1

bedtools genomecov -pc -3 -d -strand + -ibam sample.bam | md5sum
bedtools genomecov -pc -3 -d -strand - -ibam sample.bam | md5sum
bedtools genomecov -pc -5 -d -strand + -ibam sample.bam | md5sum
bedtools genomecov -pc -5 -d -strand - -ibam sample.bam | md5sum

Output1

d93beb0747c29913877fc3aa20d2b9f9  -
8010aaaa685d55dd627f48e9cf5d9328  -
d93beb0747c29913877fc3aa20d2b9f9  -
8010aaaa685d55dd627f48e9cf5d9328  -

This is true for both per-base reports (-d) and BedGraph outputs (-bga). I get the same checksum values when I exclude the -3 and -5 options, indicating that these flags are ignored in the presence of the other parameters, and bedtools is giving coverage across full fragment intervals.

Input2

bedtools genomecov -pc -d -strand + -ibam sample.bam | md5sum
bedtools genomecov -pc -d -strand - -ibam sample.bam | md5sum

Output2

d93beb0747c29913877fc3aa20d2b9f9  -
8010aaaa685d55dd627f48e9cf5d9328  -

When I take away the -pc option, I get the desired behavior (i.e. different results for 3′ and 5′ ends).

Input3

bedtools genomecov -d -strand + -ibam sample.bam | md5sum
bedtools genomecov -d -3 -strand + -ibam sample.bam | md5sum
bedtools genomecov -d -5 -strand + -ibam sample.bam | md5sum
bedtools genomecov -d -strand - -ibam sample.bam | md5sum
bedtools genomecov -d -3 -strand - -ibam sample.bam | md5sum
bedtools genomecov -d -5 -strand - -ibam sample.bam | md5sum

Output3

92459f43d4a9a229545d93552d53eab3  -
3f03d8707c0cdfbc4ebec1f539a91541  -
9d8d6f3275a01781ff044a2fe5ac8368  -
68cd7d76828f1289b71fd934fd032487  -
85dbc711cb38961a6d6d253581544fd6  -
7cca5f34bfd95e296d8e3b3cb8b091a7  -

However, the paired-end nature of my data is not handled correctly without the -pc option. I can tell because paired reads are being mapped to opposite strands, where they should be mapping to the strand indicated by the first read in the pair.

My Questions

Am I misunderstanding how genomecov parses bam files? Do I need to need to include the -du option (which has been eliminated from the help page for v2.30.0)? What are some other strategies by which I can get the per-base coverage reports in the format that I need?

From bedtools genomecov -h:

-du Change strand af [sic] the mate read (so both reads from the same strand) useful for strand specific

Edit

I think the new definition of the -pc option may be enlightening.

bedtools version 2.29.2

-pc Calculate coverage of pair-end fragments.

bedtools version 2.30.0

-pc Calculates coverage of intervals from left point of a pair reads to the right point.

Read more here: Source link