My aim is to create a custom protein sequence reference file (protein.fa) from genomic coordinates (origin.bed).
(origin.bed; with Chromosome, start, end, TranscriptID, strand, GeneID)
chr1 109202569 109202584 ENST00000370031.1_uORF_0 - ENSG00000162639.11
chr1 109203584 109203617 ENST00000370031.1_uORF_0 - ENSG00000162639.11
chr11 102188276 102188302 ENST00000263464.3_uORF_0 + ENSG00000023445.9
chr11 10830291 10830306 ENST00000530211.1_uORF_1 - ENSG00000110321.11
chr11 10830400 10830451 ENST00000530211.1_uORF_0 - ENSG00000110321.11
Desired two step output: genomic nucleotide sequence file (nucleotide.fa; 0-based) and protein sequence file (protein.fa).
(Illustration; Sequences do not fit to the coordinates in the snippet)
>chr1:109202569-109203617(ENST00000370031.1_uORF_0)
TCACCCGGAACAGATTTTTTCCTCCTTGGATGACTCCGTCATGGACAC
>chr11:102188276-102188302(ENST00000263464.3_uORF_0)
nucleotide sequence xyz
>chr11:10830291-10830306(ENST00000263464.3_uORF_0)
nucleotide sequence xyz
>chr11:10830400-10830451(ENST00000263464.3_uORF_0)
nucleotide sequence xyz
>chr1:109202569-109203617(ENST00000370031.1_uORF_0)
SPGTDFFLLGULRHGH
>chr11:102188276-102188302(ENST00000263464.3_uORF_0)
protein sequence xyz
...
An important point is that entries with the same TranscriptID should be concatenated prior to protein sequence translation. As illustrated for chr1 TranscriptID entries, there are two coordinate pairs (origin.bed; length: 15 and 33), which should form one sequence. A GeneID based concatenation is not possible, as TranscriptIDs can be different for the same GeneID (Last two entries for chr11), which would lead to faulty sequences.
Furthermore, the sequences should be strand-specific, by which (-) stranded sequences need to be reverse complemented for the nucleotide.fa.
My attempt to obtain the nucleotide sequences was based on bedtools getfasta (bedtools.readthedocs.io/en/latest/content/tools/getfasta.html). However, as seen in the doc, the desired output is not possible, but is really necessary for my downstream quality control.
Conversion of a sample nucleotide.fa does work straightforward with faTrans. Therefore, I really appreciate any ideas on how to convert the origin.bed to nucleotide.fa. Are you aware of any other packages or conversion tools that would be suitable?
Thank you!
Based on the documentation for bedtools getfasta that you linked, it would appear that it would do what you want, with the -s
option to reverse complement the features on the (-) strand and the -split
option to concatenate the exons of the same transcript.
You would first need to convert your BED file to the “BED12” format with one line per transcript, that bedtools getfasta expects when running with the -split
option.
You could write your own code to do this transformation or use a script like bed6ToBed12. Here’s an example with it:
awk 'BEGIN{FS=OFS="\t"} {print $1, $2, $3, gensub(/_.*/, "", 1, $4), 0, $5}' origin.bed |\
bed6Tobed12.sh -i stdin > origin.bed12
(The AWK invocation in front is just to convert your input file to proper BED6 format, where the strand is the 6th field, and to extract the transcript name into the 4th field.)
After this you could run bedtools getfasta to get the nucleotide sequences:
bedtools getfasta -fi reference.fa -bed origin.bed12 -s -name -split > nucleotide.fa
Answered By – Jukka Matilainen
Answer Checked By – Senaida (WPSolving Volunteer)
Read more here: Source link