GATK’s GenomicsDBImport takes forever…
I have 90 samples in the form of vcf files, together they are a few terabytes in size. I wish to create a single multi-sample vcf file for downstream analysis. I am trying to use GenomicsDBImport for this, but it just takes too long (the cluster at which we run our analyses allows a maximum of 7 days runtime, which is not nearly enough apparently).
Our reference genome has 349 contigs (not human), and when running GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.
I put both thread options to 20, which is the maximum at our cluster.
After seven days, the database is around 15 % finished.
Are there options other than specifying a smaller set of intervals? We have no idea whatsoever what intervals to keep or not, so I’d rather not mess with those unless it’s the only way…
Big thanks in advance!
• 24 views
Read more here: Source link