Potential segfault bug in featureCounts using long read data

Hi, I think I might have found a bug in featureCounts from Rsubread (v2.12.3).
I am trying to find reads overlapping exon junctions from a personalised reference, using Nanopore long read BAMs.

I am afraid I cannot share fully reproducible code as I am using my own reference, but this is is what I run and what I get

#Get exon-specific read counts with featureCounts
library(Rsubread)
sample <- commandArgs(trailingOnly = TRUE)
gtf_file <- "personalised.gtf"

#Generate a simplified annotation format (SAF) object from the gtf to get only unique exonic regions
gtf_df <- read.table(gtf_file, sep = "\t", header = FALSE, stringsAsFactors = FALSE)

saf_df <- gtf_df[gtf_df$V3 == "exon", ]
#Keep columns of interest only
saf_df <- saf_df[, c(9, 1, 4, 5, 7)]
#Rename
colnames(saf_df) <- c("GeneID", "Chr",  "Start",    "End",  "Strand")
#Clean gene ID
saf_df$GeneID <- gsub(pattern = "(.+ gene_name )(.+)(;)",
                      replacement = "\\2",
                      x = saf_df$GeneID)

saf_df <- unique(saf_df)


featureCounts_out <- featureCounts(files = "full.bam",
                                   annot.ext = saf_df, useMetaFeatures = FALSE, isGTFAnnotationFile = FALSE,
                                   minOverlap = 3, isLongRead = TRUE,
                                   isPairedEnd = FALSE, juncCounts = TRUE,
                                   genome = "personalised_reference.fa",
                                   countMultiMappingReads = FALSE, allowMultiOverlap = TRUE)

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.12.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           full.bam             ||
||                                                                            ||
||              Paired-end : no                                               ||
||        Count read pairs : no                                               ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||       Junction Counting : <output_file>.jcounts                            ||
||                 Threads : 1                                                ||
||                   Level : feature level                                    ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : counted                                          ||
||   Min overlapping bases : 3                                                ||
||          Long read mode : yes                                              ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid29999 ...         ||
||    Features : 46                                                           ||
||    Meta-features : 3                                                       ||
||    Chromosomes/contigs : 6                                                 ||
||                                                                            ||
|| Load FASTA contigs from personalised_reference.fa...                 ||
||    5 contigs were loaded                                                   ||
||                                                                            ||
|| Process BAM file full.bam...                   ||
[1]    29999 segmentation fault  R

Now, if I run it with the option juncCounts = FALSE, it does run, but I get the following error:

ERROR: sequence length in the BAM record is out of the expected region: 11207, 7084

From this post, I understand this means that a read is longer than the contig it is meant to align to. So I tried generating a bam without reads that long using

samtools view -h full.bam | awk 'length($10) < 11000 || $1 ~ /^@/' > test.bam

Using the bam without the long read then works with no issues, but I think it would be worth for future releases to either ignore such cases for junction counts or to output a more informative error.

Many thanks!

Carlos

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /flask/apps/eb/software/FlexiBLAS/3.0.4-GCC-11.2.0/lib64/libflexiblas.so.3.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.12.3

loaded via a namespace (and not attached):
[1] compiler_4.2.0  Matrix_1.5-1    tools_4.2.0     grid_4.2.0     
[5] lattice_0.20-45

Read more here: Source link