Detailed differences between sambamba and samtools

3 month , My first post in the new student group , The false-positive mutation appears because duplicates mark Not enough ?, Tells the story of supplementary read It won’t be GATK MarkDuplicates Marked as duplicates The problem of .

after , In response to this question , I began to work on my opponent’s bam Reprocess , And write a general process for the laboratory to use .

technological process

Although last time I recommended samtools rmdup and MarkDuplicatesSpark, But considering that most students use it more often GATK, and MarkDuplicatesSpark The speed of is too slow , So in the end, I choose queryname Use… After sorting MarkDuplicates To deal with it .

Put an existing bam again mark duplicates The process is as follows :

$ sambamba sort -t 24 -n tumor.bam -o query_sorted.bam
$ gatk MarkDuplicates -I query_sorted.bam -O gatk_markduplicates.bam -M temp.metrics
$ sambamba sort -t 24 gatk_markduplicates.bam -o gatk_markduplicates_sorted.bam
$ sambamba index gatk_markduplicates_sorted.bam

sambamba After all, it’s not as good as samtools Commonly used , So I wrote a samtools edition .

$ samtools sort -@ 24 -n tumor.bam -o query_sorted.bam
$ gatk MarkDuplicates -I query_sorted.bam -O gatk_markduplicates.bam -M temp.metrics
$ samtools sort -@ 24 gatk_markduplicates.bam -o gatk_markduplicates_sorted.bam
$ samtools index gatk_markduplicates_sorted.bam

But I haven’t tested this process before , Test it .

There was a mistake .

$ tail samtools_test.error
[Mon May 02 12:37:48 CST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 25.35 minutes.
Runtime.totalMemory()=28222947328
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///hpc/data/home/slst/zhangly/yangjk/duplicates/samtools/gatk_markduplicates.bam. Sort order is queryname. Offending records are at [A00583:225:H2FV5DSXY:1:1101:1000:9111] and [A00583:225:H2FV5DSXY:1:1101:1000:10207]
at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:197)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:184)
at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:36)
at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:16)
at htsjdk.samtools.util.AbstractAsyncWriter$WriterRunnable.run(AbstractAsyncWriter.java:123)
at java.lang.Thread.run(Thread.java:745)

The reason for this is queryname The question of sequencing , Although the order is queryname, But there are conflicting record.

Problem finding

Since using sambamba I succeeded in running , explain sambamba The sorting method is GATK Compatible .

The two software are respectively queryname Sorting generates results .

$ sambamba sort -t 24 -n tumor.bam -o query_sorted_sambamba.bam
$ samtools sort -@ 24 -n tumor.bam -o query_sorted_samtools.bam

Make a preliminary comparison with the previous queryname.

$ samtools view --no-header query_grouped_sambamba.bam | awk '{print $1}' | uniq | head -5
A00583:225:H2FV5DSXY:1:1101:10004:10363
A00583:225:H2FV5DSXY:1:1101:10004:11397
A00583:225:H2FV5DSXY:1:1101:10004:11584
A00583:225:H2FV5DSXY:1:1101:10004:14998
A00583:225:H2FV5DSXY:1:1101:10004:15217
$ samtools view --no-header query_grouped_samtools.bam | awk '{print $1}' | uniq | head -5
A00583:225:H2FV5DSXY:1:1101:1000:2722
A00583:225:H2FV5DSXY:1:1101:1000:2942
A00583:225:H2FV5DSXY:1:1101:1000:3881
A00583:225:H2FV5DSXY:1:1101:1000:4664
A00583:225:H2FV5DSXY:1:1101:1000:4946

The first five items are different , indicate samtools and sambamba Two different sorting methods are used .

sambamba The first of them is A00583:225:H2FV5DSXY:1:1101:10004:10363,samtools The first of them is A00583:225:H2FV5DSXY:1:1101:1000:2722. therefore , Preliminary analysis ,sambamba Is that 4 Than : Of ASCII It’s small , So bring the same person 4 Put it in the front . and samtools, Obviously think 1000 Numerically better than 10004 Small , So will 1000 Put it in front . These two sorting methods are used sort When I ordered, I met .

$ echo -e '200\n1000' | sort
1000
200
$ echo -e '200\n1000' | sort -n
200
1000

added -n Just sort by value , By default, press ASCII Sort code .

Take a closer look at the error message queryname Before and after .

$ samtools view --no-header query_grouped_sambamba.bam | awk '{print $1}' | uniq | grep -A 2 -B 2 -n -E "A00583:225:H2FV5DSXY:1:1101:1000:9111|A00583:225:H2FV5DSXY:1:1101:1000:10207"
20-A00583:225:H2FV5DSXY:1:1101:10004:5290
21-A00583:225:H2FV5DSXY:1:1101:10004:9706
22:A00583:225:H2FV5DSXY:1:1101:1000:10207
23-A00583:225:H2FV5DSXY:1:1101:1000:12054
24-A00583:225:H2FV5DSXY:1:1101:1000:12148
--
44-A00583:225:H2FV5DSXY:1:1101:1000:6605
45-A00583:225:H2FV5DSXY:1:1101:1000:6668
46:A00583:225:H2FV5DSXY:1:1101:1000:9111
47-A00583:225:H2FV5DSXY:1:1101:10013:10473
48-A00583:225:H2FV5DSXY:1:1101:10013:12759

stay sambamba in , These two reads Far apart , And the sort order is really according to each position ASCII Code ordered .

$ samtools view --no-header query_grouped_samtools.bam | awk '{print $1}' | uniq | grep -A 2 -B 2 -n -E "A00583:225:H2FV5DSXY:1:1101:1000:9111|A00583:225:H2FV5DSXY:1:1101:1000:10207"
8-A00583:225:H2FV5DSXY:1:1101:1000:6605
9-A00583:225:H2FV5DSXY:1:1101:1000:6668
10:A00583:225:H2FV5DSXY:1:1101:1000:9111
11:A00583:225:H2FV5DSXY:1:1101:1000:10207
12-A00583:225:H2FV5DSXY:1:1101:1000:12054
13-A00583:225:H2FV5DSXY:1:1101:1000:12148

samtools here , Two article reads together , Obviously, it is sorted according to the size of the value . Although this method is more intuitive , But with the GATK Are not compatible , therefore GATK After seeing it, he reported an error .

Attempt to resolve

Find out samtools After my little problem , Have a look at samtools-sort file [1].

“When the -n option is present, records are sorted by name. Names are compared so as to give a “natural” ordering — i.e. sections consisting of digits are compared numerically while all other sections are compared based on their binary representation. This means “a1” will come before “b1” and “a9” will come before “a10”. Records with the same name will be ordered according to the values of the READ1 and READ2 flags (see flags).

The original official has written ,samtools sort -n The order of execution is natural order, Specifically, the part containing numbers will be compared according to the size of the value , Lead to a9 stay a10 In front of .

I was there too github Search the , See if anyone is as confused as I am about why they do this . I found that someone really asked this question , There’s a lot more .

such as smowton[2] Put forward ,samtools sort -n The result of sorting is not lexical sort.lexical sort It means dictionary order , That is to say ASCII code ,sambamba The sorting method used . The official answer is to take a look SAM/BAM sorting order clarification[3].

After my understanding of this issue A rough reading of , Find out :

  1. Officials have long known this problem .
  2. Officials have not determined which ranking method is really correct , So I’m not going to change .
  3. The officially chosen solution is in SAM This problem is clearly stated in the format of documents (some definition of fixed).

after , I went to have a look at the mentioned SAM Format document [4].

stay Tag Of SO part .

“For queryname sort, no explicit requirement is made regarding the ordering other than that it be applied consistently throughout the entire file. Footnote 6: It is known that widely used software libraries have differing definitions of the queryname sort order, meaning care should be taken when operating on multiple files of varying provenance.

The final suggestion is , There are differences between different software , Be careful when using .

Section 1.3.1 This part introduces the common sorting methods .

summary

The problem encountered today has not been solved . in general , In order to and GATK compatible , Should not be used samtools Conduct queryname Sort , Or use sambamba, Or use GATK SortSam(GATK I must be compatible with myself ).

Last , A suggestion is right fastq The process is as follows .sambamba In dealing with pipe You need to specify the /dev/stdin and /dev/stdout.

reference=~/project/0-reference/2-GATK/hg19/human_g1k_v37_decoy.fasta
name=SRR8619267
fq1=${name}_1.clean.fq.gz
fq2=${name}_2.clean.fq.gz
bwa mem -t 24 -R "@RG\tPU:"${name}"\tID:"${name}"\tSM:"${name}"\tLB:WXS\tPL:ILLUMINA" $reference $fq1 $fq2 | \
sambamba view -t 24 -f bam -S /dev/stdin -o /dev/stdout | sambamba sort -t 24 -n /dev/stdin -o ${name}.bam
gatk MarkDuplicates -I ${name}.bam -O ${name}_markdup.bam -M ${name}.metrics -ASO queryname
sambamba sort -t 24 ${name}_markdup.bam -o ${name}_sorted.bam
sambamba index ${name}_sorted.bam

Reference material

[1]samtools-sort file : www.htslib.org/doc/samtools-sort.html

[2]smowton: github.com/samtools/samtools/issues/398

[3]SAM/BAM sorting order clarification: github.com/samtools/hts-specs/issues/5

[4]SAM Format document : samtools.github.io/hts-specs/SAMv1.pdf

Read more here: Source link