RSQLite slowness

Dear biostar readers,

why is it so much slower to query an sqlite database using RSQlite «from the outside» using param like

statement <-
 "SELECT * FROM gene2refseq R
  LEFT JOIN gene_info I ON
    I.GeneID = R.GeneID
  WHERE R.`RNA_nucleotide_accession.version`
    LIKE ?"
db <- "gene_info.sqlite"
conn <- DBI::dbConnect(RSQLite::SQLite(), db)
x1 <- DBI::dbGetQuery(conn=conn,
  statement=statement,
  param=list(Håkan20210914$RNANucleotideAccession))

compared to querying «from the inside» of sqlite, by writing your search terms as a table first, and then calling it

statement <-
 "SELECT * FROM H
  LEFT JOIN gene2refseq R ON
    R.`RNA_nucleotide_accession.version`
    LIKE '%' || H.RNANucleotideAccession || '%'
  LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
DBI::dbWriteTable(conn, "H", Håkan20210914)
x2 <- DBI::dbGetQuery(conn=conn, statement=statement)
DBI::dbDisconnect(conn)

On my system (E5-2603 v4), the first query took more than an hour, while the second took only a few minutes …

Do you guys know of any faster (but also nice) way to dig around in very large tsv files like ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz and ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz ?

This is double-posted from stat.ethz.ch/pipermail/r-help/2021-October/472465.html, but I thought some people might know more here compared to r-help …

Best,
Rasmus

Read more here: Source link

Tagged