Count 5’End Mapped To A Specific Genomic Position
I got several SAM/BAM files, and I am interested in 5’ends of the mapped reads. Is there any tools or scripts to count how many 5’ends are mapped at a specific genomic position?
N.B. I am not try to count the total reads number mapped to the specific genomic position.
For example, in the following window, I get 5 reads, 3 of them have a 5’ends at 14488, the other 2 have a 5’ends at 14487.
And I want produce a table like this.
.
• 7.3k views
My initial thought is to use bedtools bamtobed
to make a BED file from your BAMs. Then, it’s easy to use coding strand data and the start/end positions to get what you want.
Here’s my reasoning:
Let’s say you have the following segment of a BED file:
chr7 127471196 127472363 Pos1 0 +
chr7 127472363 127473530 Pos2 0 +
chr7 127473530 127474697 Pos3 0 +
chr7 127474697 127475864 Pos4 0 +
chr7 127475864 127477031 Neg1 0 -
chr7 127477031 127478198 Neg2 0 -
chr7 127478198 127479365 Neg3 0 -
chr7 127479365 127480532 Pos5 0 +
chr7 127480532 127481699 Neg4 0 -
Pos1 has its 5′ end at the number in column 2: 127471196
Neg1 has its 5′ end at the number in column 3: 127477031
In that way it’s pretty easy to create a table of 5′ starting points based on the BED info. Let me know if you need coding help.
EDIT: Here’s a script I busted out which will give you what you need. It takes a BED file through STDIN and spits a two-column tab-delimited table to STDOUT. Call it like this: perl scriptName.pl < [yourBed] > [yourTableFile]
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $lineCounter = 1;
my %startPoints;
while(<>){
my @columns = split /t/, $_;
my $startPoint;
if($columns[5] eq '+'){
$startPoint = $columns[1];
++$startPoints{$startPoint};
} elsif ($columns[5] eq '-') {
$startPoint = $columns[2];
++$startPoints{$startPoint};
} else {
die "Goofy BED Data on line $lineCounter:n$_";
}
++$lineCounter;
}
for my $key (sort {$a <=> $b} keys %startPoints) {
print "$keyt$startPoints{$key}n";
}
<3
as the location of the read is the 5′ position of the clipped read, you could use the following command:
samtools view -F 4 sorted.bam | cut -d ' ' -f3,4 | uniq -c
.
$ samtools view -F 4 examples/ex1_sorted.bam | cut -d ' ' -f3,4 | uniq -c |head
1 seq1 1
1 seq1 3
1 seq1 5
1 seq1 6
1 seq1 9
2 seq1 13
1 seq1 15
1 seq1 18
2 seq1 22
1 seq1 24
Have a look to the python script level1.py in the PromoterPipeline written by Michiel de Hoon and released in the supporting data of our article in BMC Genomics (Harbers et al., 2014). It can take mutiple BAM files directly as input and make an expression table of 5′ ends. It is written in Python and needs Pysam, that you can find for instance as a package on Debian systems.
I add a block of opendir code to @Deedee’s code. I post it here in case I need it someday.
#! /usr/bin/perl
# cd to your BED files directory before run this script
use strict;
use warnings;
# It will read all files ended with bed under current directory.
opendir(DIR, ".");
my @files = sort(grep(/bed$/, readdir(DIR)));
closedir(DIR);
my $i = 0;
foreach (@files) {
open(IN, "<$_") or die "fail to read file $_n";
open(OUT, ">$_.txt") or die "fail to write file $_.outn";
print "Start working on $files[$i]...n"; $i++;
my $lineCounter = 1;
my %startPoints;
while(<IN>) {
chomp;
my @columns = split("t", $_);
my $startPoint;
if($columns[5] eq '+') {
$startPoint = $columns[1];
++$startPoints{$startPoint};
} elsif($columns[5] eq '-') {
$startPoint = $columns[2];
++$startPoints{$startPoint};
} else{
die "Error $files[$i] on line $lineCounter:n$_";
}
++$lineCounter;
}
for my $key (sort {$a <=> $b} keys %startPoints) {
print OUT "$keyt$startPoints{$key}n";
}
close IN;
close OUT;
}
Traffic: 2287 users visited in the last hour
Read more here: Source link