I did it like that:
1- Download refGene.txt.gz and hg19.fasta from the UCSC goldenpath. ( note: convert hg19.2bit to hg19.fa using twoBitToFa )
2- Create a bed file with exon coordiniate using my awk script
// to_transcript.awk
BEGIN
{
OFS ="t"
}
{
name=$2
name2=$13
sens = $4 =="+" ? "forward" : "reverse"
chrom=$3
split($10,exon_starts,",")
split($11,exon_ends,",")
for (i=1; i<length(exon_starts); i++)
{
start = exon_starts[i];
end = exon_ends[i];
print(chrom, start, end, name2,sens, name)
}
}
And run it :
zcat refGene.txt.gz|awk -f to_transcript.awk > exon.bed
3 – Extract sequence using bedtools
bedtools getfasta -fi hg19.fa -bed exon.bed -fo exon.fa -name
4 – You may want to reverse sequence for gene placed on the reverse strand. Split exon.bed into and exon_minus.bed and exon_plus.bed using grep on column 5th. And do same as I described above to have exon_minus.fa and exon_plus.fa
You can then revert your sequence using :
seqtk seq -r exon_minus.fasta> exon_minus.rev.fasta
and concat both file :
cat exon_plus.fasta exon_minus.rev.fasta > all_exons.fa
Read more here: Source link