Array variant to reference-standard way

Dear community members,

I have an Illumia array and after transformation to VCF it looks like (one line as an example)

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NAME    
1   752721  rs3131972   C   T   .   .   PR  GT  0/1

Now I need to extract information about these variants from a large cohort of WGS samples.

The problem is – C is not actually REF allele for this variant ( www.ncbi.nlm.nih.gov/snp/rs3131972?horizontal_tab=true ). For some variants REF is actually REF, but for half they are switched.

When I look this variant in array specs, I see a line

rs3131972-138_T_R_2263598533,rs3131972,TOP,[A/G],0060710106,AACGTTCACTTTCTGTCTGTGTTCACGTCACCAAGAGAATAGAAAGGAAA,,,37,1,752721,diploid,Homo sapiens,dbSNP,138,BOT,GCCTGGACTGGAGGGCTGTCTCAAGGAGGGTGACGTGTCTTTGACTTTTGCATTCTTCCC[T/C]TTTCCTTTCTATTCTCTTGGTGACGTGAACACAGACAGAAAGTGAACGTTTTTTGCATAA,TTATGCAAAAAACGTTCACTTTCTGTCTGTGTTCACGTCACCAAGAGAATAGAAAGGAAA[A/G]GGGAAGAATGCAAAAGTCAAAGACACGTCACCCTCCTTGAGACAGCCCTCCAGTCCAGGC,1897,3,0,+

so the variant here is even A/G.

Is there a way to normalize a VCF to reference, to fix REF/ALT? I am absolutely lost since I supposed it to be a very simple procedure but it seems very complex. I can’t rely even on rs-IDs – they are missing for many array variants.

Read more here: Source link