Bioinformatics Stack Exchange is a question and answer site for researchers, developers, students, teachers, and end users interested in bioinformatics. It only takes a minute to sign up.
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
Asked
Viewed
20 times
I have been struggling to get the code I wrote for running GATK g CNV on my Linux machine (Steps Here: gatk.broadinstitute.org/hc/en-us/articles/360035531152 ) to work. Specifically it seems to abort a run before running the main code (I even removed my DetermineGermlineContigPloidy step to no avail). Below is my code and then the error:
## Paths To Be Set
echo "My paths have been set"
### Set Bam Directory:
#### Make sure to set a seperate directory for the batch you are doing
BAM_DIR=/pathToThe/Batch_I_Am_Running
### Set the path to the reference genome:
reference=/pathToMy/primary_assembly.fa
#model_dir =/path/to/my/GATK-gCNV
### Set the path to the singularity image:
SINGULARITY_CACHEDIR=/path/to/my/GATK
singularity_image=$SINGULARITY_CACHEDIR/gatk_latest.sif
### Set Number Of Bins
#### Normally 100 to 200*
NUM_BINS=200
### Setting the number of threads ?? Like Cores ??
NUM_THREADS=20
#______________________________________________________________________________________________________________________________________________________________
## Main code
echo "I am about to start the main code"
for c in "${CHROMOSOME_NAMES[@]}"; do
# loop over the directories containing the BAM files
for dir in "${BAM_DIR}"/*/; do
# get the name of the directory without the path
dir_name="$(basename "${dir}")"
# get the name of the BAM file without the extension
bam_name="$(basename "${dir}"/*.bam .bam)"
echo "Processing file: $bam_name"
echo "I am starting Step 1: Preprocessing"
# Step 1: Preprocessing
singularity exec $singularity_image PreprocessIntervals \
-R ${reference} \
-I ${dir}/${bam_name}.bam \
-L targets.interval_list \
--bin-length 0 \
-imr OVERLAPPING_ONLY \
-O targets.preprocessed.interval_list
echo "I am starting Step 2: AnnotateIntervals with GC content"
# Step 2: AnnotateIntervals with GC content
singularity exec $singularity_image AnnotateIntervals \
-L chr${c}.interval_list \
-I ${dir}/${bam_name}.bam \
-R ${reference} \
-imr OVERLAPPING_ONLY \
-O chr${c}.annotated.tsv
echo "I am starting Step 3: CollectReadCounts"
# Step 3: Collect read counts for the samples
singularity exec $singularity_image CollectReadCounts \
-I ${dir}/${bam_name}.bam \
-L targets.preprocessed.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
--format TSV \
-O ${dir_name}.${bam_name}.counts.tsv
echo "I am starting Step 4: FilteringIntervals based on GC-content & read depth"
# Step 4: FilterIntervals based on GC-content and read depth
singularity exec $singularity_image FilterIntervals \
-L chr${c}.interval_list \
--annotated-intervals chr${c}.annotated.tsv \
-imr OVERLAPPING_ONLY \
"${dir}"/*.counts.tsv \
--max-panel-norm 1.5 \
--min-panel-norm 0.3 \
-O chr${c}.filtered.interval_list
echo "Not doing Step 5 DetermineGermlineContigPloidy for now"
# Step 5: DetermineGermlineContigPloidy (Mandatory Step)
# singularity exec $singularity_image DetermineGermlineContigPloidy \
# --model-parameters ${model_dir}/model_parameters.txt \
# --contig-ploidy-calls ${model_dir}/contig_ploidy_calls.tsv \
# --contig-ploidy-priors chr20XY_contig_ploidy_priors.tsv \
# --output-prefix ${model_dir}/ploidy \
# --standardized-output \
# $(find "${output_dir}" -name "*.calls.tsv" -printf '%p ') \
# for file in ${input_dir}/*.tsv
# do
# Get sample ID from filename
# sample=$(basename ${file} .tsv)
# # Run DetermineGermlineContigPloidy command
# singularity exec ${singularity_image} DetermineGermlineContigPloidy \
# --input ${file} \
# --contig-ploidy-priors ${contig_ploidy_priors} \
# --output-prefix ${output_dir}/${sample}_ploidy \
# --standardized-output
# done
echo "Creating shards via bins"
# Creating shards via bins
singularity exec $singularity_image IntervalListTools \
SplitIntervals \
-R ${reference} \
-L chr${c}.interval_list \
-scatter ${NUM_BINS} \
-o ${dir}/${c}_shard.interval_list
echo "Starting the Shard Loop that will run GermlineCNVCaller and PostprocessGermlineCNVCalls"
# Loop over the shards and run GermlineCNVCaller and PostprocessGermlineCNVCalls
for shard_file in "${dir}/${c}"*_shard.interval_list; do
echo "Starting Step 6: Run GermlineCNVCaller (Mandatory Step)"
# Step 6: Run GermlineCNVCaller (Mandatory Step)
singularity exec $singularity_image GermlineCNVCaller \
--run-mode COHORT \
-L "${shard_file}" \
-I $(find "${dir}" -name "*.tsv" -printf '%p ') \
--minimum-interval-median-percentile 5.0 \
--maximum-zeros-in-sample-percentage 40.0 \
--annotation-cutoffs "QD=2.0" "SOR=2.0" "FS=10.0" "MQ=50.0" \
--interval-merging-rule OVERLAPPING_ONLY \
--output-prefix $(basename "${shard_file}" .interval_list)
echo "Starting Step 7: Run PostprocessGermlineCNVCalls (Mandatory Step)"
# Step 7: Run PostprocessGermlineCNVCalls (Mandatory Step)
singularity exec $singularity_image PostprocessGermlineCNVCalls \
--model-shard-path "${shard_file}" \
--cnv-call-path $(basename "${shard_file}" .interval_list).calls.tsv \
--output-dir postprocess-sm \
--scatter-count ${NUM_THREADS} \
--num-scatter-plot-columns 8 \
--max-zeros-in-sample-percentage 40.0 \
--annotation-cutoffs "QD=2.0" "SOR=2.0" "FS=10.0" "MQ=50.0" \
--interval-merging-rule OVERLAPPING_ONLY \
--minimum-contig-length 46709983 \
--minimum-interval-median-percentile 5.0
done
done
done
#______________________________________________________________________________________________________________________________________________________________
The “.out” file I received after the run aborted is:
My paths have been set
I am about to start the main code
Why would it abort so soon (I changed the number of Gs used and that did not have any effect)?
$\endgroup$
4
Read more here: Source link