MarkduplicatesSpark How to speed-up ?
Hello all,
I would like to know if there is any good option to speed up MarkduplicatesSpark ? I work with human genome with arround 900 millions reads (151 bp).
I work on a cluster (with slurm).
The command that i used is (with 60G of memory and 14 cpu) :
gatk --java-options "-Xmx${SLURM_MEM_PER_NODE}M" MarkDuplicatesSpark
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt
--tmp-dir ${tmp_dir}
--create-output-bam-index false
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log
Before running markduplicate i did :
-fastp to trimmed the fastq
-bwa mem 2
-samtools view
I supposed that my bam is sorted by query name as i didn’t do any sort step but how could i be sure ?
It took more than 1 day to finish (one file is finish after 1 day and 5 hours, the other are still running).
Please let me know if y could do anything to speed up.
Thanks in advance
Read more here: Source link