Exon coordinates and sequence

I did it like that:

1- Download refGene.txt.gz and hg19.fasta from the UCSC goldenpath. ( note: convert hg19.2bit to hg19.fa using twoBitToFa )
2- Create a bed file with exon coordiniate using my awk script

 // to_transcript.awk 

BEGIN 
{
 OFS ="t"
}
{
    name=$2
    name2=$13
    sens = $4 =="+" ? "forward" : "reverse"
    chrom=$3

    split($10,exon_starts,",")
    split($11,exon_ends,",")


    for (i=1; i<length(exon_starts); i++)
    {
        start = exon_starts[i];
        end   = exon_ends[i];
        print(chrom, start, end, name2,sens, name)
    }

}

And run it :

  zcat refGene.txt.gz|awk -f to_transcript.awk > exon.bed

3 – Extract sequence using bedtools

  bedtools getfasta -fi hg19.fa -bed exon.bed -fo exon.fa -name

4 – You may want to reverse sequence for gene placed on the reverse strand. Split exon.bed into and exon_minus.bed and exon_plus.bed using grep on column 5th. And do same as I described above to have exon_minus.fa and exon_plus.fa
You can then revert your sequence using :

   seqtk seq -r exon_minus.fasta> exon_minus.rev.fasta

and concat both file :

   cat exon_plus.fasta exon_minus.rev.fasta > all_exons.fa

Read more here: Source link