How to get FASTQ reads from the Short Read Archive (SRA)

Getting data out of the short read archive is a tedious and error prone process, thanks to the clunky interfaces and changing methodologies.

  1. if you want a subset of the reads say 1000 reads use fastq-dump -X 1000 SRR14575325
  2. if you want the entire file use fasterq-dump SRR14575325
  3. if you want to be in full control find the URLs then use wget or curl to get the data
  4. if you feel lucky or want to get super confused, fight bugs and inconsistencies use prefetch SRR14575325

We will be downloading the file SRR14575325

  • SRR14575325.sra is 577M
  • SRR14575325.fastq.gz is 3.3Gb

Note that some methods store (cache) files hence subsequent runs may perform faster.

Tools

Some examples require tools from sratools, to install them use:

# Currently installs version 2.9
conda install sratools

or visit the webpage and download binaries:

*github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

where the latest version is 2.11

Use fastq-dump directly:

# Clean the cache
rm -f ~/ncbi/public/sra/SRR14575325*

# Convert ten reads
time fastq-dump SRR14575325  -X 10
# 1 seconds

# Convert all reads
time fastq-dump SRR14575325
# 5 minutes

Total time 5 minutes. fastq-dump will stores the SRA file a cache folder. On my system is located in

~/ncbi/public/sra/SRR14575325.sra

Subsequent fastq dump on the same accession will take 1 minute.

Using fasterq-dump

fasterq-dump will is intended to be future replacement for fastq-dump, but according to its documentation, during its operation it requires up to 10x as much disk space as the final file. In addition it does not yet support getting partial readcounts:

# Clean your cache file
rm -f ~/ncbi/public/sra/SRR14575325*

# Convert all reads
time fasterq-dump -f SRR14575325
# 1.1 minutes

Total time 1 minute. fasterq-dump also stores the data in the cache as:

~/ncbi/public/sra/SRR14575325.sra.cache

but if you already prefetched it can use that as well.

Subsequent runs take 30 seconds

Download the SRA file directly

The URL in in the 10 column of the output that you get with:

# Find the URL
efetch -db sra -id SRR14575325 -format runinfo | cut -f 10 -d ,

prints:

download_path
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/SRR14575325/SRR14575325.1

Download an SRA file locally and use that:

# Clean your cache file
rm -f ~/ncbi/public/sra/SRR14575325*

URL1=https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/SRR14575325/SRR14575325.1

time wget $URL1
# 52 seconds

fastq-dump SRR14575325.1
# 1 minute

Total time 2 minutes. As before we have both an SRA and FASTQ files.

Using prefetch

The sratools prefetch command will download an SRA then store it in a cache directory. The behavior of prefetch has changed, versions before 2.10 will download files into the cache directory. Versions 2.10 and above will download the files into the local directory.

The new versions of prefetch do not operate seamlessly with fastq-dump anymore. In versions under 2.10 the two commands:

prefetch SRR14575325
fastq -dump SRR14575325

would both use the same files. With the new version, you would need to run

prefetch SRR14575325
fastq -dump SRR14575325/SRR14575325.sra

¯_(ツ)_/¯ … all in the name of progress I guess. Just remember that commands and examples in training materials may not work correctly anymore. Some people claim that prefetch can download fastq files with

prefetch --type fastq SRR14575325

but I get:

2021-10-14T18:01:00 prefetch.2.11.2 err: name not found while resolving query within virtual file system module - failed to resolve accession 'SRR1972739' - no data ( 404 )

Let me all put you to ease here. That a tool in sratools package raises absurd and weird errors is normal behavior. These types of errors have been happening for a long time, it is completely unclear why or what to do about it. If you get this error just pick a different method from the list.

Let’s continue the journey, this was run with version 2.9:

# Clean the cache
rm -f ~/ncbi/public/sra/SRR14575325*

time prefetch SRR14575325
# 57 seconds

# Convert ten reads
time fastq-dump SRR14575325  -X 10
# 0 seconds

# Convert all reads
time fastq-dump SRR14575325
# 1 minute

Total time of 2 minutes. Stores the SRA file in the cache under the name:

Subsequent selections with fastq-dump will take 1 minute.

Download from EBI

to find the EBI link to an SRA file use:

curl -X GET "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRR14575325&fields=fastq_ftp&result=read_run"

prints:

run_accession   fastq_ftp
SRR14575325 ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz

Let’s use the EBI link:

URL2=ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz
wget $URL2

The download is slow, estimated time 15 minutes, I did not wait to finish.

Read more here: Source link