So many variants detected.

So many variants detected.

0

Dear All,

I have done variant calling in Germline data that has single sample of each individual and two genes.

I did following steps, but after checking results I found too many variants.

After Haplotypecaller (the step 6) I found 140900 known variants, and the steps 7 and 8 didn’t filter any variants.

What should be causing this result?

1. index the reference fasta (hg19.fa)

    bwa index hg19.fa



2. mapping to the reference genome (hg19.fa)
   I randomly gave read group information. 


    bwa mem -t 16 -R '@RGtID:21002tSM:Sample_21002' hg19.fa ../../Trim_final/21002.1.FASTP.fastq.gz  ../../Trim_final/21002.2.FASTP.fastq.gz > 21002.bam


3. picard sort bam file by queryname



     java -jar ~/applications/picard-tools-2.23.0/picard.jar SortSam I=21002.bam  O=21002.bam.PICARD.output.sorted.queryname.bam SORT_ORDER=queryname


4. mark duplicates by MarkDuplicatesSpark

    gatk --java-options "-Xmx5g" MarkDuplicatesSpark -I 21002.bam.PICARD.output.sorted.queryname.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam --tmp-dir ./Test/Tmps/

5. Base (Quality Score) Recalibration

    Tools involved: BaseRecalibrator, Apply Recalibration

    ##first run BaseRecalibrator and later run ApplyBQSR using a table produced by BaseRecalibrator and a bam file produced at the step4.

    ###make an index of dbsnp_138.hg19.vcf

    gatk IndexFeatureFile -F  dbsnp_138.hg19.vcf ## this will generate dbsnp_138.hg19.vcf.idx 


    ##### add read group and platform to bam files;

    java -jar ~/applications/picard-tools-2.23.0/picard.jar AddOrReplaceReadGroups I=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam  O=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam RGID=21002  RGLB=21002lib1 RGPL=ILLUMINA RGPU=21002 RGSM=20 



#####BaseRecalibrator


    gatk BaseRecalibrator -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam  --known-sites dbsnp_138.hg19.vcf  --known-sites ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table

  #####ApplyBQSR

    gatk ApplyBQSR -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam --bqsr-recal-file 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam


6. HaplotypeCaller (single sample mode)

gatk HaplotypeCaller --bam-output 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.HaplotypeCaller.OUTPUT.bam --native-pair-hmm-threads 64 -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -D dbsnp_138.hg19.vcf


7. CNNScoreVariants




      gatk CNNScoreVariants -R hg19.fa -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf

8. FilterVariantTranches




       gatk --java-options "-Xmx7g" FilterVariantTranches -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf --resource ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz --resource ../dbsnp_138.hg19.vcf --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf.FilterVariantTranches.OUTPUT.vcf


gatk4


variants

• 12 views


Login
before adding your answer.

Traffic: 2373 users visited in the last hour

Source link