GRCh37 GFF filter transcript isoforms by RefSeq Select tag or longest
Dear all,
I tried to filter the “RefSeq Select” transcript isoforms in the GRCh37.p13 human genome annotation gff (GCF_000001405.25_GRCh37.p13_genomic.gff.gz). Specifically my goal is to retain for each gene a transcript isoform with a tag=RefSeq Select
attribute if exists, and in case none is present, the longest isoform.
My idea was to use @Juke34’s AGAT tools. I planned to
- filter the longest isoforms
- filter a second gff for the tag using
agat_sp_filter_feature_by_attribute_presence.pl
->docu, - create an ID list of the genes contained in that gff
- remove those from the longest isoform gff
- merge the resulting gff with the filtered
On the one hand, the number of steps in this strategy makes it a bit delicate, on the other hand, it falls flat at step 2 anyway:
# get gff (~20mb)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# build toy set
zgrep "gene=KRAS;" GCF_000001405.25_GRCh37.p13_genomic.gff.gz
>kras_excerpt.gff
agat_sp_filter_feature_by_attribute_presence.pl
--gff kras_excerpt.gff
--attribute tag
--flip
--output test_flip
# empty result
cat test_flip
##gff-version 3
It’s quite likely my googlefoo failed me this time: Does anybody have some recommendation how to approach this? Or is the only solution writing yet another gff processor tool?
• 20 views
Read more here: Source link