I am going to be working with VCF files a lot in the near future so I thought I would brush up on the practice.
After much reading and research, there’s something that I just can’t wrap my head around.
1) In a diploid organism, you have 2 alleles for a particular gene. My question is, how is this captured within the reference alignment sequence when the Reference alignment sequence is “single stranded” in that it fails to capture a possible heterozygous individual for a particular gene. For example, the reference genome at a particular locus will only have 1 nucleotide present.
In the following example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480
It is deemed that the sample NA12878 is heterozygous for this position in that he/she has a T/G allele at each locus in the Chromosome 20. The question from above is referring to the reference. There is only 1 base. In actuality shouldn’t there be maybe 2 alleles if the reference individual was heterozygous? If the reference individual also was heterozygous at this position and lets say he/she also had a G allele at the same locus, then shouldn’t the VCF be reported as 0/0 and there would in fact maybe be no variants at all?
Flipping this around, lets say the NA12878 individual was used as a reference. At position 10001019 in Chromosome 20, which would be the REF? Would it be the T or the G allele since the person is heterozygous for both?