How to get sample names and genotype for SNP in multi-sample VCF file

Last update: May 9, 2021

You must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:

bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf

I probably should explain what’s going on here, too: It is divided into 4 parts (each part indicated by the starting <(bcftools on each line):

  1. First awk command: outputs the first few columns from the VCF/BCF

  2. Three BCFtools query commands: tabulate the number of samples having each variant type (as you
    requested). This will work for phased and/or un-phased variants. The output is the 3 columns named nHet, nHomAlt, nHomRef

  3. Three BCFtools view commands: look through the file again, saving sample names into an
    array called ‘header‘, and then printing the indices of the array
    where a particular field (i.e. sample) has a particular type of
    variant. This is repeated for: a, heterozygous (HetSamples), b, homozygous (alt) (HomSamplesAlt), and
    c, homozygouse (ref) (HomSamplesRef).

I’ve tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.

paste <(bcftools view MyVariants.Norm.vcf |
    awk -F"t" 'BEGIN {print "CHRtPOStIDtREFtALT"} 
      !/^#/ {print $1"t"$2"t"$3"t"$4"t"$5}') 
    
  <(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
    awk 'BEGIN {print "nHet"} {print gsub(/0|1|1|0|0/1|1/0/, "")}') 
    
  <(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
    awk 'BEGIN {print "nHomAlt"} {print gsub(/1|1|1/1/, "")}') 
    
  <(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
    awk 'BEGIN {print "nHomRef"} {print gsub(/0|0|0/0/, "")}') 
    
  <(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HetSamples"} 
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0|1|1|0|0/1|1/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}') 
    
  <(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HomSamplesAlt"} 
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1|1|1/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}') 
    
  <(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HomSamplesRef"} 
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0|0|0/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}') 
    
  | sed 's/,t/t/g' | sed 's/,$//g'


CHR POS     ID                  REF ALT nHet nHomAlt nHomRef    HetSamples  HomSamplesAlt   HomSamplesRef
1   10177   1:10177:10177:A:AC  A   AC  4   0   1   HG00096,HG00097,HG00099,HG00100     HG00101
1   10235   1:10235:10235:T:TA  T   TA  0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   10352   1:10352:10352:T:TA  T   TA  5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   10616   1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C  CCGCCGTTGCAAAGGCGCGCCG  C   0   5   0       HG00096,HG00097,HG00099,HG00100,HG00101 
1   10642   1:10642:10642:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11008   1:11008:11008:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11012   1:11012:11012:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11063   1:11063:11063:T:G   T   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13110   1:13110:13110:G:A   G   A   1   0   4   HG00097     HG00096,HG00099,HG00100,HG00101
1   13116   1:13116:13116:T:G   T   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13118   1:13118:13118:A:G   A   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13273   1:13273:13273:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13284   1:13284:13284:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13380   1:13380:13380:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13483   1:13483:13483:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13494   1:13494:13494:A:G   A   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13550   1:13550:13550:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   14464   1:14464:14464:A:T   A   T   1   1   3   HG00099 HG00096 HG00097,HG00100,HG00101
1   14599   1:14599:14599:T:A   T   A   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14604   1:14604:14604:A:G   A   G   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14930   1:14930:14930:A:G   A   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   14933   1:14933:14933:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15211   1:15211:15211:T:G   T   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   15245   1:15245:15245:C:T   C   T   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15274   1:15274:15274:A:G   A   G   3   0   2   HG00096,HG00100,HG00101     HG00097,HG00099
1   15274   1:15274:15274:A:T   A   T   3   2   0   HG00096,HG00100,HG00101 HG00097,HG00099 

Read more here: Source link