Last update: May 9, 2021
You must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:
bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf
I probably should explain what’s going on here, too: It is divided into 4 parts (each part indicated by the starting <(bcftools
on each line):
-
First
awk
command: outputs the first few columns from the VCF/BCF -
Three
BCFtools query
commands: tabulate the number of samples having each variant type (as you
requested). This will work for phased and/or un-phased variants. The output is the 3 columns namednHet
,nHomAlt
,nHomRef
-
Three
BCFtools view
commands: look through the file again, saving sample names into an
array called ‘header‘, and then printing the indices of the array
where a particular field (i.e. sample) has a particular type of
variant. This is repeated for: a, heterozygous (HetSamples
), b, homozygous (alt) (HomSamplesAlt
), and
c, homozygouse (ref) (HomSamplesRef
).
I’ve tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.
paste <(bcftools view MyVariants.Norm.vcf |
awk -F"t" 'BEGIN {print "CHRtPOStIDtREFtALT"}
!/^#/ {print $1"t"$2"t"$3"t"$4"t"$5}')
<(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
awk 'BEGIN {print "nHet"} {print gsub(/0|1|1|0|0/1|1/0/, "")}')
<(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
awk 'BEGIN {print "nHomAlt"} {print gsub(/1|1|1/1/, "")}')
<(bcftools query -f '[t%SAMPLE=%GT]n' MyVariants.Norm.vcf |
awk 'BEGIN {print "nHomRef"} {print gsub(/0|0|0/0/, "")}')
<(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HetSamples"}
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0|1|1|0|0/1|1/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}')
<(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HomSamplesAlt"}
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1|1|1/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}')
<(bcftools view MyVariants.Norm.vcf | awk -F"t" '/^#CHROM/ {split($0, header, "t"); print "HomSamplesRef"}
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0|0|0/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "n"}}}')
| sed 's/,t/t/g' | sed 's/,$//g'
CHR POS ID REF ALT nHet nHomAlt nHomRef HetSamples HomSamplesAlt HomSamplesRef
1 10177 1:10177:10177:A:AC A AC 4 0 1 HG00096,HG00097,HG00099,HG00100 HG00101
1 10235 1:10235:10235:T:TA T TA 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 10352 1:10352:10352:T:TA T TA 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10616 1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0 5 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10642 1:10642:10642:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11008 1:11008:11008:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11012 1:11012:11012:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11063 1:11063:11063:T:G T G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13110 1:13110:13110:G:A G A 1 0 4 HG00097 HG00096,HG00099,HG00100,HG00101
1 13116 1:13116:13116:T:G T G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13118 1:13118:13118:A:G A G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13273 1:13273:13273:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13284 1:13284:13284:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13380 1:13380:13380:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13483 1:13483:13483:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13494 1:13494:13494:A:G A G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13550 1:13550:13550:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 14464 1:14464:14464:A:T A T 1 1 3 HG00099 HG00096 HG00097,HG00100,HG00101
1 14599 1:14599:14599:T:A T A 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14604 1:14604:14604:A:G A G 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14930 1:14930:14930:A:G A G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 14933 1:14933:14933:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15211 1:15211:15211:T:G T G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 15245 1:15245:15245:C:T C T 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15274 1:15274:15274:A:G A G 3 0 2 HG00096,HG00100,HG00101 HG00097,HG00099
1 15274 1:15274:15274:A:T A T 3 2 0 HG00096,HG00100,HG00101 HG00097,HG00099
Read more here: Source link