When importing my quant.sf files into R using tximport, should I set ‘ignoreTxVersion’ to True or False?


I’m working through my first batch of RNA-Seq analysis and unfortunately I don’t have an experienced bioinformatician to work with. My question is regarding tximport of my quant.sf files into R. I have been working with the EquCab3.0 reference transcriptome from NCBI to generate these quant.sf files, but I have encountered a problem when importing into R.

Although the NCBI .fasta file has the NM_transcript identifiers, they don’t appear to have unique delimiters for the corresponding gene ID’s that I need to create the 2 column data.frame. Therefore, I ran the following from the EquCab3.0 gtf file:


gtf <- rtracklayer::import("GCF_002863925.1_EquCab3.0_genomic.gtf")

Then to generate the mapping table I ran

tx2gene = unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])


 transcript_id      gene_id
1  XM_005602135.3        SYCE1
14 XM_005602137.3        SYCE1
26 XM_023636208.1 LOC111772506
33 XM_023636216.1 LOC111772506
39 XM_023636230.1 LOC111772506
45 XM_023636202.1 LOC111772506

For reference, each of my 10 quant.sf files look like this:

head quant.sf

Name    Length  EffectiveLength TPM NumReads
NM_001081757.1  5648    5649.138    5.559883    1566.000
NM_001081758.1  5984    5841.368    0.028292    8.240
NM_001081759.1  5982    5654.000    0.000000    0.000
NM_001081760.1  5980    5652.000    0.000000    0.000
NM_001081761.1  5505    5177.000    0.000000    0.000
NM_001081762.1  4872    4553.211    0.010932    2.482
NM_001081763.1  4629    4471.577    13.722008   3059.301
NM_001081764.1  4553    4225.000    0.000000    0.000
NM_001081765.1  4125    3676.439    29.286268   5368.281

For the final code I ran the ‘ignoreTxVersion = T’ which gives an error message

count_data = tximport(files = sample_files,
         type = "salmon",
         tx2gene = tx2gene,
         ignoreTxVersion = T) 

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [NM_001081757, NM_001081758, NM_001081759, ...]

Example IDs (tx2gene): [XM_005602135.3, XM_005602137.3, XM_023636208.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

However, when I set ‘ignoreTxVersion = F’, it seems to work:

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
summarizing abundance
summarizing counts
summarizing length

Is there something obvious I have missed here, or am I ok to move forward with the analysis? From what I’ve read, it’s better to set this option to ‘TRUE’ for downstream analysis with DeSeq. Any advice is greatly appreciated.

Thanks all!

Read more here: Source link