Running htseq-count to “grab” long non coding gene_id names

Running htseq-count to “grab” long non coding gene_id names


hi all,

new to bioinformatics. so bare with me.. I am trying find long non coding RNA from RNA-seq data. As i checked the human gtf file there are 2 different types of long non coding RNA, “lnc_RNA” and “lncRNA”, like so:

NC_000001.11 Gnomon transcript 29926 31295 . + . gene_id
“MIR1302-2HG”; transcript_id “XR_001737835.1”; db_xref
“GeneID:107985730”; gbkey “ncRNA”; gene “MIR1302-2HG”; model_evidence
“Supporting evidence includes similarity to: 100% coverage of the
annotated genomic feature by RNAseq alignments, including 8 samples
with support for all annotated introns”; product “MIR1302-2 host gene,
transcript variant X2”; transcript_biotype “lnc_RNA”;

NC_000001.11 BestRefSeq gene 34611 36081 . – . gene_id “FAM138A”;
transcript_id “”; db_xref “GeneID:645520”; db_xref “HGNC:HGNC:32334”;
description “family with sequence similarity 138 member A”; gbkey
“Gene”; gene “FAM138A”; gene_biotype “lncRNA”; gene_synonym “F379”;
gene_synonym “FAM138F”;

“lnc_RNA” is on the “transcript” line, and “lncRNA” is on the “gene” line. My first question is should I choose “lncRNA” ?

And most importantly, how do i get only the “gene_id” names of the ones that have “lncRNA” ?

edit: for the 2nd question i did: grep ‘lncRNA’ GRCh38.p13_genomic.gtf > GRCh38.p13_genomic_lnc.gtf
and proceeded as usual.

But is my choice correct of the lncRNA?




Read more here: Source link