MarkduplicatesSpark How to speed-up ?

MarkduplicatesSpark How to speed-up ?


Hello all,

I would like to know if there is any good option to speed up MarkduplicatesSpark ? I work with human genome with arround 900 millions reads (151 bp).

I work on a cluster (with slurm).

The command that i used is (with 60G of memory and 14 cpu) :

gatk --java-options "-Xmx${SLURM_MEM_PER_NODE}M" MarkDuplicatesSpark 
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt 
--tmp-dir ${tmp_dir} 
--create-output-bam-index false 
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log

Before running markduplicate i did :
-fastp to trimmed the fastq
-bwa mem 2
-samtools view

I supposed that my bam is sorted by query name as i didn’t do any sort step but how could i be sure ?

It took more than 1 day to finish (one file is finish after 1 day and 5 hours, the other are still running).

Please let me know if y could do anything to speed up.

Thanks in advance






Read more here: Source link