SRA/ENA library layout is inconsistent with the data source

project number: PRJNA505380
An example of Run accession: SRR8244780
Issue:
Inconsistency between the library layout of Run and data source.

As the library layout both in ENA and SRA labeled, Runs in Bioproject PRJNA505380 should be pair-end reads data. But some of them only have a single fastq and without underscore “_1” or “_2” to indicate the pair-ended setting.

I took the example data for a closer look using following code under Ubuntu 18:

grep @SRR8244780.1000510 SRR8244780.fastq

Print some of the results here:

@SRR8244780.10005100 10005100/2
@SRR8244780.10005101 10005101/2
@SRR8244780.10005102 10005102/2
@SRR8244780.10005103 10005103/2
@SRR8244780.10005104 10005104/2
@SRR8244780.10005105 10005105/2
@SRR8244780.10005106 10005106/2
@SRR8244780.10005107 10005107/2
@SRR8244780.10005108 10005108/2
@SRR8244780.10005109 10005109/2
@SRR8244780.100051087 100051087/1
@SRR8244780.100051088 100051088/1
@SRR8244780.100051089 100051089/1
@SRR8244780.100051090 100051090/1
@SRR8244780.100051091 100051091/1
@SRR8244780.100051092 100051092/1
@SRR8244780.100051093 100051093/1
@SRR8244780.100051094 100051094/1
@SRR8244780.100051095 100051095/1
@SRR8244780.100051096 100051096/1
@SRR8244780.100051097 100051097/1

Because I can’t see the original id for each read. I can only assume that all these ID I “grep”ed are unique read ID. There is no duplicated read ID showing like this:

@SRR123456789.123 123/1
@SRR123456789.123 123/2

Generally, if you have a single-end read with illumina identifier, it should look like this:

grep HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686 a.fastq

Output:

@SRRxxxx.xxxx HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686/1

For single-end read fastq, you should only get one read ID and no /2 tag (if I’m correct).

Clearly the read ID in my case has both /1 and /2 tags. What makes me confused is that there is no duplicated ID but contains pair-end tags in the same fastq. I’m not sure whether this data is a concatenated fastq or interleaved fastq. Someone previously use tree command to seperate interleaved fastq to two fq. I tried either, it is very time-consuming so I didn’t finish it.

My question is : How to deal with this kind of data? Can I just treat it as a single-end data? Or these data cannot be used for downsteam analysis?

I used fastp for quality assessment of this data by setting it as a single-read fq. The N reads and N bases are equal as reported in ENA.

Thank you.

BTW, this is not the first time I met this problem. I don’t understand why ENA and SRA both allow the submitters to mistakenly upload this kind of data as “pair-end” without simply checking how many files they uploaded. Not to mention that NCBI SRA does not allow concatenated raw fq to be uploaded.

Read more here: Source link