How to parallelize bcftools mpileup with GNU parallel?

I think that this would be a solution:

I started with a file that contains the scaffold name and lengths:

JH739887    30495534
JH739888    29527584
JH739889    22321128
[...]

I then added a column with “1” in it to make it the “start” position. Since the scaffolds are not placed on the genome, they all start at one (at least from what I understand)

awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes

Remove all the spaces to create a tab delimited file

tr ' ' 't' < chrom.start.stop.sizes > chrom.start.stop_tabs.sizes

The file should look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]

Then split (gsplit because I’m on a mac!) the files so that they each contain about 2724 scaffolds (the total number of scaffolds is 27239 so 27239/10 ~2724 so the last file will have 1 scaffold less than all the other files).

gsplit --numeric-suffixes=1 -n l/10 --additional-suffix='.txt' chrom.start.stop_tabs.sizes scaff

The first file should have about 2724 lines which correspond to the scaffolds and look like this:

JH739887    1   30495534
JH739888    1   29527584
JH739889    1   22321128
[...]

When I used wc -l on the 10 files, I had that:

2501+2618 +2618 +2749 +2792 +2792 +2792 +2793 +2792 +2792  = 27239 total

Then, after the generation of bam files AND the samtools indexing of the bam files, use this code to make it parallel.

parallel -j 15 'bcftools mpileup -Ou -f ref.fa -R scaff$(printf "%02d" {}).txt -b bam_list.txt | bcftools call -m -v -Oz -o myfile{}.vcf.gz' ::: {1..10}

Some explainations:

-j 15: use 15 cores
-R scaff$(printf "%02d" {}).txt: will be printing the file name with pads of zeros (so it's going to be scaff01.txt, etc.)
- {}: is a placeholder in parallel which is going to be replaced by what is after the ::: which is the `{1..10}`.

Improvement in the future:
I realized that the scaffolds were ordered by size. So the scaffolds at the end of the list will run VERY fast and the one at the top of the list VERY slow. So mixing the scaffolds in the files in gsplit could be nice. In the end, this might not be ideal, but at least, a solution for people who want to experiment with this code!

UPDATE: To shuffle do this;

Before the line:

awk '$1 = $1 FS "1"' chrom.sizes > chrom.start.stop.sizes

Add:

shuf chrom.sizes > chrom.start.stop.sizes.shuffled

And change the name of the file accordingly. You could set a seed by looking at this post which I paste the function here:

get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt 
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

Concatenating the files: at the end of the parallel command, there will be 10 different files that need to be put together. Here is a small script that’ll do it:

for i in *.vcf.gz
do 
echo `basename $i`
done | sort -V >> vcf_list.txt

bcftools concat --file-list vcf_list.txt -Ob --output output.bcf

Read more here: Source link