Phasing with SHAPEIT

Edit June 7, 2020:

The code below is for pre-phasing with SHAPEIT2. For phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs, see my answer here: A: ERROR: You must specify a valid interval for imputation using the -int argument,

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE (
    C: Phasing with SHAPEIT )
  2. phased imputation and generation of phaed VCFs (
    A: ERROR: You must specify a valid interval for imputation using the -int argument, )

—————-

Thanks – good to know that there is now a GRCh38 version of that data! – I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III – Version 2

If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.

Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:

  1. checking overlap of your data with the reference panel (via
    -check)
  2. removal of problematic variants (also via check)
  3. pre-phase using filtered input data

1, first run a QC check (will throw error)

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial 
      shapeit 
        -check 
        -B temp 
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt 
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample 
        --output-log Prephased/MyData_chr"${chr}"_alignments 
        -T 8 ;
  fi
done ;
rm temp.* ;

This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp:

2, exclude problematic variants that were found

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial 
      shapeit 
        -check 
        -B temp 
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt 
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample 
        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude 
        -T 8 ;
  fi
done ;
rm temp.* ;

This should now run to completion and not return any error.

NB – this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3

3, now perform pre-phasing

Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=12 --cpus-per-task=8 --partition=serial 
      shapeit 
        -B temp 
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt 
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample 
        --exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude 
        -O Prephased/MyData_chr"${chr}"_1KGphased 
        -T 8 ;
  fi
done ;
rm temp.* ;

Read more here: Source link