How to plot coverage and depth statistics of a bam file

This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. While this question is 2 years old but I am hoping there are many individuals trying to work out the solution to this problem, so I hope this will be helpful for others.

Identification of the depth of coverage is quite useful in 1) identifying the regions that might have potential paralogous alignment 2) finding the coverage at your regions of interest.

Step #1) First identify the depth at each locus from a bam file.

I have found samtools depth option more useful in this regard, when coverage at each locus is desired.

samtools depth deduped_MA605.bam > deduped_MA605.coverage

The output file ‘deduped_MA605.coverage‘ file will have 3 columns (Chr#, position and depth at that position) like below.

Chr  position depth (this header will be absent though)
1       3980        66
1       3981        68
1      3982     67
1      3983        67
1      3984     68

Step #2) Now, select the coverage (depth) by locus for each chromosome and/or regions

We can use the coverage file to plot it in R. But, the file is so large that it will suck up almost all the memory.
It better to split the coverage by chromosome (or region of the chromosome if required).

To select the coverage for a particular chromosome (Chr#1 in my case)

awk '$1 == 1 {print $0}' deduped_MA605.coverage > chr1_MA605.coverage

To select coverage from chr #2

awk '$1 == 2 {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

If the chrosomosome has string characters it can be adjusted as

awk '$1 == "chr2" {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

Step #3) To plot the data in R this coverage file will need to be imported and the headers need to be added.

MA605.chr2 <- read.table("/media/everestial007/SeagateBackup4.0TB/New_Alignment_Set/05-B-deDupedReads-forScanIndel/chr2_MA605.coverage",
         header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)`

Note: The header of the column are automatically set as V1, V2 and V3.

To rename the headers

library(reshape) # loads the library to rename the column names

MA605.chr2<-rename(MA605.chr2,c(V1="Chr", V2="locus", V3="depth")) # renames the header

Now, plot the coverage by depth:

plot(MA605.chr2$locus, MA605.chr2$depth)

to get the wider/cleaner view of the plot use

library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16), scales=list(x=list(relation='same'), y=list(relation='same')), data=MA605.chr2, main="depth by locus - Chr2 (SampleMA605)")

The output plot looks like this:
coverage depth by locus

Note:

  • We can see that there is a region around the middle and next to it that has a very high coverage. This likely suggests paralogous alignment.

  • Also, there is a gap in alignment. The region with no alignment is near (at) centromere where no consensus sequence has been found.

  • Reads may be later filtered by coverage.

Thanks,

Read more here: Source link