Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

NB – Update July 29, 2020 – this thread will no longer be watched and, for all intents and purposes, will now be archived

NB – Version 2 of tutorial can be found here and should be used going forward –> Produce PCA bi-plot for 1000 Genomes Phase III – Version 2


My own process for producing a PCA bi-plot from the 1000 Genomes Phase III is below. In R, you’ll have to sort out the plot colours yourself. The PED file contains sample ID to ethnicity mappings.

The protocol changes if you want to merge your own data to the 1000 Genomes. I’ve done this many times and have developed my own predictive ethnic model (>99% sensitivity).

NB – requires at least plink 1.9

data no longer available

wget ;
  • Ensure that multi-allelic calls are split (1st part)
  • Left-align indels (2nd part)
  • Set the ID field to a unique value: CHROM:POS:REF:ALT (3rd part)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

 for chr in {1..22}; do
    bcftools norm -m-any ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | 
    bcftools norm --check-ref w -f human_g1k_v37.fasta | 
    bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
for chr in {1..22}; do
    plink --noweb --bcf ALL.chr$chr.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf 
    --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed 
    --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

NB – This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

If you kept rs IDs in your input files (step 3), then duplicates (yes, they exist in 1000 Genomes and dbSNP!) will be found. If you change the ID to CHROM:POS:REF:ALT, then no duplicates will be found.

mkdir DupsRemoved ;

for chr in {1..22}; do
    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes 
    --list-duplicate-vars ids-only ;

    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes 
    --exclude plink.dupvar --make-bed 
    --out DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    rm plink.dupvar ;
--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium – takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb --bfile DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes 
    --maf 0.10 --indep 50 5 1.5 
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    plink --noweb --bfile DupsRemoved/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes 
    --extract Pruned/ALL.chr"${chr}" --make-bed 
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;
plink --merge-list ForMerge.list --out Merge ;
plink --bfile Merge --cluster --mds-plot 10

plink --bfile Merge --pca

options(scipen=100, digits=3)

#Read in the eigenvectors
eigen <- data.frame(read.table("plink.eigenvec", header=FALSE, skip=0, sep=" "))
rownames(eigen) <- eigen[,2]
eigen <- eigen[,3:ncol(eigen)]


#Determine the proportion of variance of each component
proportionvariances <- ((apply(eigen, 1, sd)^2) / (sum(apply(eigen, 1, sd)^2)))*100

plot(eigen[,1], eigen[,2])

[need to manage colours and layout yourself]

Read more here: Source link