GRCh37 GFF filter transcript isoforms by RefSeq Select tag or longest

GRCh37 GFF filter transcript isoforms by RefSeq Select tag or longest

0

Dear all,

I tried to filter the “RefSeq Select” transcript isoforms in the GRCh37.p13 human genome annotation gff (GCF_000001405.25_GRCh37.p13_genomic.gff.gz). Specifically my goal is to retain for each gene a transcript isoform with a tag=RefSeq Select attribute if exists, and in case none is present, the longest isoform.

My idea was to use @Juke34’s AGAT tools. I planned to

  1. filter the longest isoforms
  2. filter a second gff for the tag using agat_sp_filter_feature_by_attribute_presence.pl ->docu,
  3. create an ID list of the genes contained in that gff
  4. remove those from the longest isoform gff
  5. merge the resulting gff with the filtered

On the one hand, the number of steps in this strategy makes it a bit delicate, on the other hand, it falls flat at step 2 anyway:

# get gff (~20mb)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# build toy set
zgrep "gene=KRAS;" GCF_000001405.25_GRCh37.p13_genomic.gff.gz 
    >kras_excerpt.gff
agat_sp_filter_feature_by_attribute_presence.pl 
    --gff kras_excerpt.gff 
    --attribute tag 
    --flip 
    --output test_flip
# empty result
cat test_flip
##gff-version 3

It’s quite likely my googlefoo failed me this time: Does anybody have some recommendation how to approach this? Or is the only solution writing yet another gff processor tool?


GFF

• 20 views

Read more here: Source link