find the mate of a read using pysam

I just came across this problem, and this is the solution I’m using right now. I’m sharing it here as it is one of the first and most recent posts that comes after googling.

My use case is as follows:

  • I have a list of positions I need to analyze
  • I need to get all the short reads in each position, so I need to use a (position) sorted and indexed .bam file
  • For each short read in each position, I need the mate information (location, sequence, mapping quality)
  • As OP said, the AlignmentFile.mate(r) method is too slow.
  • I also tried using a bam ordered by name without success.

My current solution, inspired by this post

I’m using the IndexedReads class from the pysam library.

Index a Sam/BAM-file by query name while keeping the original sort order intact.

After indexing, you can use it to access the mates much quicker than with a positioned ordered .bam file. In my system, in order to get the mates of 200 short reads (200 short reads is the number of reads I get at a particular position) it takes:

  • Around 9 seconds to get the 200 mates using AlignmentFile.mate(r)
  • Around 0.26 seconds to get the 200 mates using the the index.

This is definitely and improvement, but using the index has some overhead. For a .bam file of around 50GB (full genome), it requires on my system:

  • 51 minutes to build the index
  • 130 GB of RAM

In order to skip the index building every time you run your script, the obvious solution is to save it to disk. Unfortunately pysam (that I’m aware of) does not offer this option, and neither does samtools. I was not able to find a solution to save the whole IndexedReads class from the outside, so we have to modify the pysam library, as suggested in the PR 1037

The PR adds a load() and store() methods to the IndexedReads class. Some numbers (using pickle, not json as in the PR)

  • Time to store the index: 13 minutes
  • Time to load the index: 16 minutes
  • RAM after loading: 150 GB (for some reason it is higher than after building, I guess some pickle object is not cleared correctly).

Unfortunately the loading step still takes some time (keep in mind my system has an Hard Disk, so with an SSD it should improve.)

From some quick math, for this method to be beneficial I would need to process more than 114 breakpoints, of 200 short reads each (which is my case).

The solution is not perfect, but it may help in the following conditions:

  • You have enough RAM available.
  • You have to get the mates from a lot of short reads.
  • You plan on running the script multiple times.

Some snippets of code that may help:

  • Build, load and store index:
self.shorts: AlignmentFile = AlignmentFile(shorts_file, "rb")
self.shorts_name_idx: IndexedReads = IndexedReads(self.shorts, True)

shorts_name_idx_file: str = shorts_file.replace('.bam',
                                                '_name_idx.pickle')

if path.exists(shorts_name_idx_file):
    self.log_inf.info(f"{shorts_name_idx_file} found, load")
    self.shorts_name_idx.load(shorts_name_idx_file)
    self.log_inf.info(f"loaded {shorts_name_idx_file}")
else:
    self.log_inf.info(f"{shorts_name_idx_file} not found, build")
    self.shorts_name_idx.build()
    self.shorts_name_idx.store(shorts_name_idx_file)
    self.log_inf.info(f"built {shorts_name_idx_file}")
  • Iterate reads in a position and get the mate using the index
for read in self.bam.fetch(self.chrom, self.pos)
    current_pos: int = self.shorts.tell()
    try:
        found: IteratorRow = self.shorts_name_idx.find(read.query_name)

    except KeyError as e:
        print(f"Can not get reads with name {read.query_name}: {e}")

    else:
        mate: AlignedSegment = None
        for found_read in found:
            if (found_read.is_read1 and read.is_read2) or \
                    (found_read.is_read2 and read.is_read1):
                mate = found_read

        if mate is not None:
            print(mate.query_sequence)
    finally:
        # For some reason we need to reset the iterator, even if the 
        # IndexedReads is supposed to use a different iterator.
        self.shorts.seek(current_pos)
  • The patch that I’m using for pysam:
diff --git a/pysam/libcalignmentfile.pyi b/pysam/libcalignmentfile.pyi
index 74637f8..b3694d8 100644
--- a/pysam/libcalignmentfile.pyi
+++ b/pysam/libcalignmentfile.pyi
@@ -235,3 +235,5 @@ class IndexedReads:
     ) -> None: ...
     def build(self) -> None: ...
     def find(self, query_name: str) -> IteratorRow: ...
+    def store(self, filename: str) -> None: ...
+    def load(self, filename: str) -> None: ...
diff --git a/pysam/libcalignmentfile.pyx b/pysam/libcalignmentfile.pyx
index 1ee799b..7c75655 100644
--- a/pysam/libcalignmentfile.pyx
+++ b/pysam/libcalignmentfile.pyx
@@ -64,6 +64,8 @@ except ImportError:
 import re
 import warnings
 import array
+import pickle
+
 from libc.errno  cimport errno, EPIPE
 from libc.string cimport strcmp, strpbrk, strerror
 from libc.stdint cimport INT32_MAX
@@ -2928,6 +2930,16 @@ cdef class IndexedReads:
             self.header = samfile.header
             self.owns_samfile = False

+    def store(self, filename):
+        '''Save the index to disk, in the specified filename, using pickle dump.'''
+        with open(filename, 'wb') as f:
+            pickle.dump(self.index, f)
+
+    def load(self, filename):
+        '''Load the index from disk, in the specified filename, using pickle load.'''
+        with open(filename, 'rb') as f:
+            self.index = pickle.load(f)
+
     def build(self):
         '''build the index.'''

In the patch I’m using pickle instead of json, as it gives a smaller index size. The patch is based on commit 8463ba474dfe.

To install the patched pysam in your env, you can:

  • Clone the pysam repo and move to the folder
  • Apply the patch
  • Install cython with pip install cython
  • Install pysam using the setup file with python setup.py install

Read more here: Source link