I have 90 samples in the form of vcf files, together they are a few terabytes in size. I wish to create a single multi-sample vcf file for downstream analysis. I am trying to use GenomicsDBImport for this, but it just takes too long (the cluster at which we run our analyses allows a maximum of 7 days runtime, which is not nearly enough apparently).

Our reference genome has 349 contigs (not human), and when running GenomicsDBImport I specify intervals corresponding to all chromosomes, all bases in every chromosome.

I put both thread options to 20, which is the maximum at our cluster.

After seven days, the database is around 15 % finished.

Are there options other than specifying a smaller set of intervals? We have no idea whatsoever what intervals to keep or not, so I’d rather not mess with those unless it’s the only way…

