count protein-coding genes per contig

If you have an annotation file, as for example, the following GTF from human:

1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1   havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1";

You could get the number of protein coding genes per chromosome using:

awk '$3=="gene"' Homo_sapiens.GRCh38.98.chr.gtf | grep protein_coding | cut -f1 | sort | uniq -c


Login
before adding your answer.

Read more here: Source link