[SOLVED] Special .bed to .fa conversion (GenomicCoordinates/DNAsequence) ~ Linux Fixes

My aim is to create a custom protein sequence reference file (protein.fa) from genomic coordinates (origin.bed).

(origin.bed; with Chromosome, start, end, TranscriptID, strand, GeneID)

chr1    109202569       109202584       ENST00000370031.1_uORF_0        -       ENSG00000162639.11
chr1    109203584       109203617       ENST00000370031.1_uORF_0        -       ENSG00000162639.11
chr11   102188276       102188302       ENST00000263464.3_uORF_0        +       ENSG00000023445.9
chr11   10830291        10830306        ENST00000530211.1_uORF_1        -       ENSG00000110321.11
chr11   10830400        10830451        ENST00000530211.1_uORF_0        -       ENSG00000110321.11

Desired two step output: genomic nucleotide sequence file (nucleotide.fa; 0-based) and protein sequence file (protein.fa).

(Illustration; Sequences do not fit to the coordinates in the snippet)

nucleotide sequence xyz
nucleotide sequence xyz
nucleotide sequence xyz
protein sequence xyz

An important point is that entries with the same TranscriptID should be concatenated prior to protein sequence translation. As illustrated for chr1 TranscriptID entries, there are two coordinate pairs (origin.bed; length: 15 and 33), which should form one sequence. A GeneID based concatenation is not possible, as TranscriptIDs can be different for the same GeneID (Last two entries for chr11), which would lead to faulty sequences.
Furthermore, the sequences should be strand-specific, by which (-) stranded sequences need to be reverse complemented for the nucleotide.fa.

My attempt to obtain the nucleotide sequences was based on bedtools getfasta (bedtools.readthedocs.io/en/latest/content/tools/getfasta.html). However, as seen in the doc, the desired output is not possible, but is really necessary for my downstream quality control.

Conversion of a sample nucleotide.fa does work straightforward with faTrans. Therefore, I really appreciate any ideas on how to convert the origin.bed to nucleotide.fa. Are you aware of any other packages or conversion tools that would be suitable?

Thank you!

Based on the documentation for bedtools getfasta that you linked, it would appear that it would do what you want, with the -s option to reverse complement the features on the (-) strand and the -split option to concatenate the exons of the same transcript.

You would first need to convert your BED file to the “BED12” format with one line per transcript, that bedtools getfasta expects when running with the -split option.

You could write your own code to do this transformation or use a script like bed6ToBed12. Here’s an example with it:

awk 'BEGIN{FS=OFS="\t"} {print $1, $2, $3, gensub(/_.*/, "", 1, $4), 0, $5}' origin.bed |\
bed6Tobed12.sh -i stdin > origin.bed12

(The AWK invocation in front is just to convert your input file to proper BED6 format, where the strand is the 6th field, and to extract the transcript name into the 4th field.)

After this you could run bedtools getfasta to get the nucleotide sequences:

bedtools getfasta -fi reference.fa -bed origin.bed12 -s -name -split > nucleotide.fa

Answered By – Jukka Matilainen

Answer Checked By – Senaida (WPSolving Volunteer)

Read more here: Source link