Hello, I have 2 types of VEP annotated VCF file – regular vcf and gnomad genome file.
I would like to extract counts of both missense, synonymous, upstream and intron variants for each gene in each file. Output should be something similar to this:
MHTFR: missense 23, intron 100, synonymous 300
or
missense: MTHFR : 23, BRCA1: 50,
, etc. etc
I have looked for similar issues here, but no appropriate solution was found, or issue remained unresolved.
VEP summary files gives information about vcf file only in general. SnpEFF Count outputs rather reads, not variant count.
I tried to write Python script, but for gnomad genomes with various counts of VEP field and size of 12 GM Python is too difficult, and also running it could take ages.
Here is example of gnomad:
> chrY 2893551 . TTTTA T . AC0
> AC=0;AN=33443;AF=0.00000;AC_non_neuro_nfe=0;many_frequency_data_here;many_VEP_field_here|upstream_gene_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|ENST00000652562|processed_transcript||||||||||1|3001|-1|deletion||HGNC|HGNC:37119|||||||||||||||||||||
Here is fragment of sample VCF:
chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000335137 Transcript synonymous_variant 216 180 60 S tcA/tcG - IMPACT=LOW;STRAND=1
I am really stuck with this, so any help will be appreciated.
Read more here: Source link