Tools To Calculate Average Coverage For A Bam File?
I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.
Thanks
• 180k views
if you have samtools and awk, you can keep it simple with this:
samtools depth *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}'
and with stdev:
samtools depth *bamfile* | awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'
Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:
computes a histogram
of feature coverage (e.g., aligned
sequences) for a given genome.
Optionally, by using the –d option, it
will report the depth of coverage at
each base on each chromosome in the
genome file (-g).
The way we run this is to first make a tab-delimited genome.txt file with “contig_name contig_size” information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:
$ genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt
This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for).
And the mandatory R solution. The Rsamtools package can be used to read in the BAM file.
Here is a code example:
library(Rsamtools)
bamcoverage <- function (bamfile) {
# read in the bam file
bam <- scanBam(bamfile)[[1]] # the result comes in nested lists
# filter reads without match position
ind <- ! is.na(bam$pos)
## remove non-matches, they are not relevant to us
bam <- lapply(bam, function(x) x[ind])
ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE))
## names of the bam data frame:
## "qname" "flag" "rname" "strand" "pos" "qwidth"
## "mapq" "cigar" "mrnm" "mpos" "isize" "seq" "qual"
## construc: genomic ranges object containing all reads
ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bam$rname )
## returns a coverage for each reference sequence (aka. chromosome) in the bam file
return (mean(coverage(ranges)))
}
And the output:
> bamcoverage("~/test.bam")
gi|225184640|emb|AL009126.3|
15.35720
The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient.
Here’s a very simple perl script I wrote that does exactly this that you’re free to use:
($num,$den)=(0,0);
while ($cov=<STDIN>) {
$num=$num+$cov;
$den++;
}
$cov=$num/$den;
print "Mean Coverage = $covn";
Just pipe samtools pileup to it like so:
/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl
It will also work easily for genomic regions. Just index the BAM file and use samtools view like so:
/path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print $4}' | perl mean_coverage.pl
Hope someone finds this useful.
@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:
#!/usr/bin/env perl
use strict;
use warnings;
# Usage: samtools pileup file.bam | bam_coverage.pl
my $num; # per residue coverage
my $len; # sequence length counter
my $min = 1000; # minimum coverage
my $max = 0; # maximum coverage
while (<>) {
my @a = split /t/;
$num += $a[3];
$min = $a[3] if $min > $a[3];
$max = $a[3] if $max < $a[3];
$len++;
}
printf "Mean coverage : %.1fn", $num/$len;
printf "Coverage range : %d - %dn", $min, $max;
How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file.
This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in $PATH
):
BBMap: sourceforge.net/projects/bbmap/
bbmap/pileup.sh -h
Written by Brian Bushnell
Last modified May 19, 2015
Description: Calculates per-scaffold coverage information from an unsorted sam file.
Usage: pileup.sh in=<input> out=<output>
Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.
Input Parameters:
in=<file> The input sam file; this is the only required parameter.
ref=<file> Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file> An optional fasta file with ORF header information in PRODIGAL's output format. Must also specify 'outorf'.
unpigz=t Decompress with pigz for faster decompression.
..
BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed
Traffic: 2438 users visited in the last hour
Read more here: Source link