command not found, in IMPUTE2

Edit June 7, 2020:

The code below is for phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs. For the initial pre-phasing process with SHAPEIT2, see my answer here: Phasing with SHAPEIT

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE (
    Phasing with SHAPEIT )
  2. phased imputation and generation of phased VCFs (
    ERROR: You must specify a valid interval for imputation using the -int argument, -use_prephased_g: command not found, in IMPUTE2 )


Hi,

I gave the answer in the other thread, regarding the pre-phasing of data using SHAPEIT2 ( Phasing with SHAPEIT ). I can see that you are now a different user (?) who is doing the next step, i.e., the imputation, using the pre-phased haplotypes?

Unless you have a stick of RAM that’s the size of the Sun, you will indeed have to do the imputation in chunks. You also need to therefore know the lengths of your chromosomes. Basically, this can be achieved via shell scripting. Here is how I did it for interrval (‘chunk’) sizes of 5 megabase (5 million bases):

for chr in {1..22}; do
  case "${chr}" in
    1)
      max=249250621
    ;;
    2)
      max=243199373
    ;;
    3)
      max=198022430
    ;; 
    4)
      max=191154276
    ;;
    5)
      max=180915260
    ;;
    6)
      max=171115067
    ;;
    7)
      max=159138663
    ;;
    8)
      max=146364022
    ;;
    9)
      max=141213431
    ;;
    10)
      max=135534747
    ;;
    11)
      max=135006516
    ;;
    12)
      max=133851895
    ;;
    13)
      max=115169878
    ;;
    14)
      max=107349540
    ;;
    15)
      max=102531392
    ;;
    16)
      max=90354753
    ;;
    17)
      max=81195210
    ;;
    18)
      max=78077248
    ;;
    20)
      max=63025520
    ;;
    19)
      max=59128983
    ;;
    22)
      max=51304566
    ;;
    21)
      max=48129895
    ;;
  esac

  chunk=1 ;
  interval=5000000 ;
  start=0 ;
  end="${interval}" ;

  while [ $end -lt $max ] ;
  do
    srun --mem=32 --cpus-per-task=32 --partition=serial 
      impute 
        -phase 
        
        -use_prephased_g 
        -known_haps_g Prephased/GSA_QCd_chr"${chr}"_1KGphased.haps 
        -strand_g GSA/GSA_strandinfo_chr"${chr}".list 
        
        -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt 
        
        -h library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz 
        -l library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz 
        
        -align_by_maf_g 
        -int $((start+1)) "${end}" 
        -Ne 20000 
        -o Imputed_Phased/GSA_chr"${chr}"_chunk"${chunk}"_1KG ;

    start=$(($start+$interval)) ;
    end=$(($end+$interval)) ;
    chunk=$(($chunk+1)) ;

    echo "${chr}" "${start}" "${end}" "${chunk}" ;
  done ;
done ;

I got the chromosome lengths from the fai file that’s produced from samtools faidx for the GRCh37 1000 Genomes FASTA reference genome. You can see the link for this genome in step 3, here: Produce PCA bi-plot for 1000 Genomes Phase III – Version 2

Also note that I add the -phase parameter, which will perform a phased imputation. With your code, an un-phased imputation will be performed. Some of your other parameters differ from mine, so, please check those.

Once your imputation is complete, you can convert the resulting haps files to vcf via:

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

After that, you’ll need BCFtools commands to piece your data back together, and more time and RAM.

Trust that this assists you.

Kevin

Read more here: Source link