Count 5’End Mapped To A Specific Genomic Position

Count 5’End Mapped To A Specific Genomic Position

7

I got several SAM/BAM files, and I am interested in 5’ends of the mapped reads. Is there any tools or scripts to count how many 5’ends are mapped at a specific genomic position?

N.B. I am not try to count the total reads number mapped to the specific genomic position.

For example, in the following window, I get 5 reads, 3 of them have a 5’ends at 14488, the other 2 have a 5’ends at 14487.

enter image description here

And I want produce a table like this.

enter image description here.


counts


bam


sam


reads

• 7.3k views

Here is the simplest solution, using R.

library(Rsamtools)
allReads <- scanBam("your.bam")
firstBases <- resize(allReads, 1)
startCounts <- coverage(firstBases)

The resize function takes into account the strand of the reads.

My initial thought is to use bedtools bamtobed to make a BED file from your BAMs. Then, it’s easy to use coding strand data and the start/end positions to get what you want.

Here’s my reasoning:

Let’s say you have the following segment of a BED file:

chr7    127471196  127472363  Pos1  0  +
chr7    127472363  127473530  Pos2  0  +
chr7    127473530  127474697  Pos3  0  +
chr7    127474697  127475864  Pos4  0  +
chr7    127475864  127477031  Neg1  0  -
chr7    127477031  127478198  Neg2  0  -
chr7    127478198  127479365  Neg3  0  -
chr7    127479365  127480532  Pos5  0  +
chr7    127480532  127481699  Neg4  0  -

Pos1 has its 5′ end at the number in column 2: 127471196

Neg1 has its 5′ end at the number in column 3: 127477031

In that way it’s pretty easy to create a table of 5′ starting points based on the BED info. Let me know if you need coding help.

EDIT: Here’s a script I busted out which will give you what you need. It takes a BED file through STDIN and spits a two-column tab-delimited table to STDOUT. Call it like this: perl scriptName.pl < [yourBed] > [yourTableFile]

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my $lineCounter = 1;
my %startPoints;

while(<>){
        my @columns = split /t/, $_;
        my $startPoint;
        if($columns[5] eq '+'){
                $startPoint = $columns[1];
                ++$startPoints{$startPoint};
        } elsif ($columns[5] eq '-') {
                $startPoint = $columns[2];
                ++$startPoints{$startPoint};
        } else {
                die "Goofy BED Data on line $lineCounter:n$_";
        }
        ++$lineCounter;
}

for my $key (sort {$a <=> $b} keys %startPoints) {
        print "$keyt$startPoints{$key}n";
}

<3

as the location of the read is the 5′ position of the clipped read, you could use the following command:

samtools view -F 4  sorted.bam  | cut -d '   ' -f3,4 | uniq -c

.

$ samtools view -F 4 examples/ex1_sorted.bam  | cut -d '   ' -f3,4 | uniq -c |head
      1 seq1    1
      1 seq1    3
      1 seq1    5
      1 seq1    6
      1 seq1    9
      2 seq1    13
      1 seq1    15
      1 seq1    18
      2 seq1    22
      1 seq1    24

Have a look to the python script level1.py in the PromoterPipeline written by Michiel de Hoon and released in the supporting data of our article in BMC Genomics (Harbers et al., 2014). It can take mutiple BAM files directly as input and make an expression table of 5′ ends. It is written in Python and needs Pysam, that you can find for instance as a package on Debian systems.

Pileup from the BBMap package can do this:

pileup.sh in=file.sam basecov=coverage.txt startcov=t

Without the “startcov=t” flag, it will give the coverage at each position; with that flag, it just gives the count of reads starting at each position.

I add a block of opendir code to @Deedee’s code. I post it here in case I need it someday.

#! /usr/bin/perl
# cd to your BED files directory before run this script

use strict;
use warnings;

# It will read all files ended with bed under current directory.
opendir(DIR, ".");
my @files = sort(grep(/bed$/, readdir(DIR)));
closedir(DIR);

my $i = 0;

foreach (@files) {
      open(IN, "<$_") or die "fail to read file  $_n";
      open(OUT, ">$_.txt") or die "fail to write file $_.outn";

      print "Start working on $files[$i]...n"; $i++;

      my $lineCounter = 1;
      my %startPoints;

      while(<IN>) {
              chomp;
              my @columns = split("t", $_);
              my $startPoint;
              if($columns[5] eq '+') {
                      $startPoint = $columns[1];
                      ++$startPoints{$startPoint};
              } elsif($columns[5] eq '-') {
                      $startPoint = $columns[2];
                      ++$startPoints{$startPoint};
              } else{
                      die "Error $files[$i] on line $lineCounter:n$_";
              }   
              ++$lineCounter;
      }   

      for my $key (sort {$a <=> $b} keys %startPoints) {
              print OUT "$keyt$startPoints{$key}n";
      }   

      close IN; 
      close OUT;
}


Login
before adding your answer.

Traffic: 2287 users visited in the last hour

Read more here: Source link