Help for extraction of fasta sequences

Hello everyone, I hope you are well.

I am writing this post because I have a question or rather I have a problem with my workflow.

Perform a workflow for RNA-seq processing as follows:

quality control – Hisat2 – Stringtie – Deseq2

A simple, normal workflow that threw me important differential expression data. However, when using Hisat2 and Stringtie, with Hisat2 I get .SAM files that I obviously compress with Samtools to .bam so that stringtie can work with them. Then Stringtie generates gtf output files for me.

In the gtf annotation file that Stringtie throws at me, there are obviously no sequences of the genes it is annotating. Stringtie assigns id to these genes and as I continue in my workflow, Deseq2 continues to use them.

Unfortunately, the annotation files can be limited and Stringtie simply assigns an ID’s to a possible gene.

In Deseq2 I can do the differential expression analysis and it tells me which genes are overexpressing and which are not. But when I see which genes are the ones with the most activity, I see that there are the id assigned by Stringtie.

I would like to extract the sequence “fasta” of those ID’s to carry out an alignment (it can be in blast) that tells me which gene would “be” presenting there.

I hope I’m not crazy and think that what I’m saying can be done.

Read more here: Source link