Tools To Calculate Average Coverage For A Bam File?

Tools To Calculate Average Coverage For A Bam File?

12

I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.

Thanks


coverage


bam


sequencing

• 180k views

if you have samtools and awk, you can keep it simple with this:

samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'

and with stdev:

samtools depth *bamfile*  |  awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'

updated 3.1 years ago by

34k

written 8.5 years ago by

▴

460

Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:

computes a histogram
of feature coverage (e.g., aligned
sequences) for a given genome.
Optionally, by using the –d option, it
will report the depth of coverage at
each base on each chromosome in the
genome file (-g).

The way we run this is to first make a tab-delimited genome.txt file with “contig_name contig_size” information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:

$ genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt

This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for).

And the mandatory R solution. The Rsamtools package can be used to read in the BAM file.

Here is a code example:

library(Rsamtools)
bamcoverage <- function (bamfile) {
  # read in the bam file
  bam <- scanBam(bamfile)[[1]] # the result comes in nested lists
  # filter reads without match position
  ind <- ! is.na(bam$pos)
  ## remove non-matches, they are not relevant to us
  bam <- lapply(bam, function(x) x[ind])
  ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE))
  ## names of the bam data frame:
  ## "qname"  "flag"   "rname"  "strand" "pos"    "qwidth"
  ## "mapq"   "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"
  ## construc: genomic ranges object containing all reads
  ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bam$rname )
  ## returns a coverage for each reference sequence (aka. chromosome) in the bam file
  return (mean(coverage(ranges)))      
}

And the output:

> bamcoverage("~/test.bam")
gi|225184640|emb|AL009126.3| 
                    15.35720

The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient.

Here’s a very simple perl script I wrote that does exactly this that you’re free to use:

($num,$den)=(0,0);
while ($cov=<STDIN>) {
    $num=$num+$cov;
    $den++;
}
$cov=$num/$den;
print "Mean Coverage = $covn";

Just pipe samtools pileup to it like so:

/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl

It will also work easily for genomic regions. Just index the BAM file and use samtools view like so:

/path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print $4}' | perl mean_coverage.pl

Hope someone finds this useful.

A modern, very fast, solution for this would be mosdepth.

@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:

#!/usr/bin/env perl

use strict;
use warnings;

# Usage: samtools pileup file.bam | bam_coverage.pl

my $num;        # per residue coverage
my $len;        # sequence length counter 
my $min = 1000; # minimum coverage
my $max = 0;    # maximum coverage

while (<>) {
    my @a = split /t/;
    $num += $a[3];
    $min = $a[3] if $min > $a[3];
    $max = $a[3] if $max < $a[3];
    $len++;
}

printf  "Mean coverage  : %.1fn", $num/$len;
printf  "Coverage range : %d - %dn", $min, $max;

updated 3.1 years ago by

34k

written 10.7 years ago by

&utrif;

360

How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file.

global coverage stats

coverage histogram

percentage of genome covered x times plot

qualimap.bioinfo.cipf.es/

use

samtools pileup accepted_hits.bam | awk '{ count++ ; SUM += $4 } END { print "Total: " SUM "t" "Nucleotides: " count "t" "Average_coverage: " SUM/count }'

(use $8 instead; if pileup with -cf option)

updated 3.1 years ago by

34k

written 10.2 years ago by

8.1k

This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in $PATH):

BBMap: sourceforge.net/projects/bbmap/

bbmap/pileup.sh -h

Written by Brian Bushnell
Last modified May 19, 2015

Description:  Calculates per-scaffold coverage information from an unsorted sam file.

Usage:  pileup.sh in=<input> out=<output>

Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.

Input Parameters:
in=<file>           The input sam file; this is the only required parameter.
ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.
unpigz=t            Decompress with pigz for faster decompression.

..

updated 3.1 years ago by

34k

written 6.3 years ago by

9.9k

BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed

Now the new release of Samtools for coverage analysis.

samtools coverage input.bam

22 days ago by


MSRS

&utrif;

480


Login
before adding your answer.

Traffic: 2438 users visited in the last hour

Read more here: Source link