Read block operation failed with BAM file

I am trying to automate extracting data from an indexed BAM file and I am getting read errors:

[E::bgzf_read_block] Invalid BGZF header at offset 51686517456
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes

This failure happens after about 12 hours of analysis, near the end of the 52GB BAM file:

$ ll HUDEP.control.DS182418.chr8.bam*
-rw-r--r-- 1 areynolds stamlab 51901713847 Jun 15 12:46 HUDEP.control.DS182418.chr8.bam
-rw-r--r-- 1 areynolds stamlab     7250712 Jun 15 12:59 HUDEP.control.DS182418.chr8.bam.bai

I ran samtools quickcheck on it, which reports no issues:

$ samtools quickcheck -vvv HUDEP.control.DS182418.chr8.bam
verbosity set to 3
checking HUDEP.control.DS182418.chr8.bam
opened HUDEP.control.DS182418.chr8.bam
HUDEP.control.DS182418.chr8.bam is sequence data
HUDEP.control.DS182418.chr8.bam has 25 targets in header.
HUDEP.control.DS182418.chr8.bam has good EOF block.

I can otherwise read data from it with manual samtools and pysam commands to random locations within the BAM file.

$ python
Python 3.8.16 (default, Mar  2 2023, 03:21:46) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> print(pysam.__version__)
0.21.0

Remaking the BAM file might be an option but it is very time-consuming. I’d like to understand what went wrong, so I don’t have to remake files again.

Is that the only option or are there other tests and fixes I could apply, before this?

Read more here: Source link