I am trying to automate extracting data from an indexed BAM file and I am getting read errors:
[E::bgzf_read_block] Invalid BGZF header at offset 51686517456
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
This failure happens after about 12 hours of analysis, near the end of the 52GB BAM file:
$ ll HUDEP.control.DS182418.chr8.bam*
-rw-r--r-- 1 areynolds stamlab 51901713847 Jun 15 12:46 HUDEP.control.DS182418.chr8.bam
-rw-r--r-- 1 areynolds stamlab 7250712 Jun 15 12:59 HUDEP.control.DS182418.chr8.bam.bai
I ran samtools quickcheck
on it, which reports no issues:
$ samtools quickcheck -vvv HUDEP.control.DS182418.chr8.bam
verbosity set to 3
checking HUDEP.control.DS182418.chr8.bam
opened HUDEP.control.DS182418.chr8.bam
HUDEP.control.DS182418.chr8.bam is sequence data
HUDEP.control.DS182418.chr8.bam has 25 targets in header.
HUDEP.control.DS182418.chr8.bam has good EOF block.
I can otherwise read data from it with manual samtools
and pysam
commands to random locations within the BAM file.
$ python
Python 3.8.16 (default, Mar 2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> print(pysam.__version__)
0.21.0
Remaking the BAM file might be an option but it is very time-consuming. I’d like to understand what went wrong, so I don’t have to remake files again.
Is that the only option or are there other tests and fixes I could apply, before this?
Read more here: Source link