Produce PCA bi-plot for 1000 Genomes Phase III

Note1 – Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 – this data is for hg19 / GRCh37

Note3 – GRCh38 data is available HERE

The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

  • tutorial now entirely streamlined – all commands, including in R, are now included
  • duplicate variants are now removed with BCFtools, not PLINK (previous
    Step 6 removed)
  • now only performs PCA (originally, MDS was also performed but never
  • no longer using chrX variants (only autosomal variants)
  • new Step 3, indicating how to download the 1000 Genomes GRCh37
    reference build

Program requirements:

  • plink > v1.9
  • BCFtools (tested on v 1.3)

Disk space requirements:

  • downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
  • converted BCF files and their indices, ~14 GB
  • binary PLINK files, ~53 GB
  • pruned PLINK binary files, ~ <1 Gb
prefix="" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
wget ;
wget ;

wget ;

gunzip human_g1k_v37.fasta.gz ;

NB – if wget is not working, try curl:

curl -O human_g1k_v37.fasta.gz
  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta 
      ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz | 
      bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | 
        bcftools norm -Ob --rm-dup both 
          > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ;
for chr in {1..22}; do
    plink --noweb 
      --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf 
      --vcf-idspace-to _ 
      --allow-extra-chr 0 
      --split-x b37 no-fail 
      --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;

NB – This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium – takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb 
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes 
      --maf 0.10 --indep 50 5 1.5 
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;

    plink --noweb 
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes 
      --extract Pruned/ALL.chr"${chr}" 
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;
find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;
plink --merge-list ForMerge.list --out Merge ;

NB - if you have your own data that you want to merge with 1000 Genomes

Process the 1000 Genomes data as per this tutorial from Steps 1-9. In this way, you will have already identified the population-specific variants / markers. Then, after Step 9, do

  • find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
  • merge the 1000 Genomes data with your own data
  • proceed to Step 10

Depending on its size, your own dataset may be divided by chromosome; so, you may have to do some pre-processing before aligning to 1000 Genomes. Either way, the population specific markers will be defined by just the 1000 Genomes dataset (Step 7). If your dataset is microarray, you’ll have to pre-filter it for coding (plus / +) strand variants (Step 6).

plink --bfile Merge --pca

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')

# read in the PED data
PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = 't')
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] TRUE

# set colours

# from:
PED$Population <- factor(PED$Population, levels=c(

col <- colorRampPalette(c(

# generate PCA bi-plots
project.pca <- eigenvec

par(mar = c(5,5,5,5), cex = 2.0,
  cex.main = 7, cex.axis = 2.75, cex.lab = 2.75, mfrow = c(1,2))

plot(project.pca[,1], project.pca[,2],
  main = 'A',
  adj = 0.5,
  xlab = 'First component',
  ylab = 'Second component',
  font = 2,
  font.lab = 2)
points(project.pca[,1], project.pca[,2], col = col, pch = 20, cex = 2.25)
  bty = 'n',
  cex = 3.0,
  c('Population 1', 'Population 2', 'Population 3',
    'Population 4', 'Population 5'),
  fill = c('yellow', 'forestgreen', 'grey', 'royalblue', 'black'))

plot(project.pca[,1], project.pca[,3],
  xlab="First component",
  ylab="Third component",
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)



Read more here: Source link