Low transcript quantification with Salmon using GRCm39 annotations

Hi everyone, first time working with mouse samples and unfortunately, there are fewer resources available for the latest mouse Ensembl genome than I was expecting.

What I’ve done:
I performed rRNA depletion on total RNA extracted from mouse tissue and created Illumina libraries using a cDNA synthesis kit with random hexamer priming from Thermo. Libraries looked great by TapeStation so we sequenced on a NextSeq and the %Q30 was fantastic for both reads.

I used TrimGalore for adaptor trimming (lost ~20% of my reads), then did a bioinformatic rRNA depletion filter (lost another ~1%), then ~75% of the total reads mapped to the mouse genome. I used STAR and the GRCm39 v105 annotation from Ensembl (hereby called genome.fasta).

Before running Salmon, I downloaded the corresponding GTF file (GRCm39 v105 from Ensembl). I indexed genome.fasta using samtools faidx then used gffread -w to extract a transcripts.fasta file from the .GTF file. I then used both genome.fasta and transcripts.fasta to create a GRCm39 index with Salmon. After running Salmon, I’m seeing 40-45% of reads that mapped to the mouse genome are mapped to the mouse transcriptome.

I’m trying to see where the rest of the 55-60% of the reads are going- I first tried to use read_distribution.py in RSeQC but I haven’t figured out how to make a genome reference file that worked. I downloaded the GENCODE VM27.refseq annotation from the UCSC Table Browser and unsurprisingly, no tags were found (my guess is the incompatibility between GENCODE and Ensembl gene names). I did some searching and found cvbio UpdateContigNames but there wasn’t a GENCODE -> Ensembl file available. I’ve tried to manually change chr1 to 1 but that also did not work.

I next tried using Picard CollectRnaSeqMetrics but had some trouble making a refFlat object. I used gtfToGenePred -genePredExt on the GTF file then moved the columns around and was able to use the output to run CollectRnaSeqMetrics but it’s saying ~97% of bases are mapping to intergenic regions.

I looked at the .bam file with IGV and by eye, the majority of reads map to exons with fewer reads mapping to introns and even fewer reads mapping to unannotated regions.

All in all, I’m beginning to believe that this GRCm39 v105 GTF file is an incomplete annotation and I’m not really sure what else to try. It feels like I’m wasting the majority of my seq runs if only ~30% of total reads are actually mapping to the transcriptome and usable for differential gene expression analysis.

Any suggestions?

Read more here: Source link