I am struggling with finding a solution to a problem which seems easy but it’s not. I found many many questions that seems to be related (and I believe they are) but they are confusing and you never know which one fits your case. So there we go. I’ll try to keep it simple for future readers that are in this same situation.
What I have:
- file_R1 containing lots of mixed samples reads
- file_R2 containing lots of mixed samples reads
- table.csv containing 3 columns: sample_name, forward_string, reverse_string
Table.csv has many rows, one for each combination of forward_string and reverse_string.
For example: I have 10 forward_string and 10 reverse_string. This means that I have 100 samples (10 forward_string x 10 reverse_string = 100 combinations).
Each sample is characterised by two reads, R1 and R2 that are found by a combination of forward_string[1..10] – reverse_string[1…10], call it (fwd, rev). Ideally, once I find R1 and R2 for sample_i (with 1 <= i <= 100) these reads will have the same header: bingo! I now have all the info I need to determine from which sample these two reads come from.
Each read (it does not matter if in file_R1 or in file_R2) may contain one of these:
- no_string <- this is a sample that I will NOT consider
file R1.fastq contains forward reads in this form:
@read:111111 1:N:0:1 AAAAAAAAACTGACGTTGAGGGACGAAGCCTTGGGT + HHGHHHHHHHHHGFGFGHABBCCFFFFFFFGGGGGG @read:9085 1:N:0:1 CGCAGAGTGATGGCCAGCCGCCCGTGAAATTCCCGGGCTCAACC + AA@A??AA/BAF21FA/BEFGCCCGAFGADAFFF1BFG1F1EEC @read:7634 1:N:0:1 CGATACGTGTGCCAGCAGCCGCGGTAATACGTAGGTGGCA + GAFGFG?EFHHAGGGGEGGGGFBEFGEB@-->--/@@@@B
file R2.fastq contains reverse reads in this form:
@read:111111 2:N:0:1 CCCTGTTATTAGGTTCGTTGAGGGACGAAGCCTTGGGTAACG + ABGGGGGGGGGBCCFFFFFFFGGGGGHHGGHHHGGHHHGHHG @read:6576 2:N:0:1 CGCAGAGTGTGCCAGCCGCCGCGGTAATACGAAGGGGGCT + GAFGFG/>//?EAF/:FBF??B-FB-/BB@--B/BB9A-- @read:3457 2:N:0:1 CGATACGTGTGCGGTCGTCAAGTACTAAACTGTAACTGACGCT + BBBBBBBFFFB..099AC.:.CFF0BF/////;/;9;///.;.
the matching table is like this:
sample day site forward reverse S1 1 1 AAAAA CCCCC S2 1 1 AAAAA TTTAA S3 1 2 AAAAA TTTAG S4 1 2 AAAAA TTTAC S1 2 1 TTTTT TTTGA S2 2 1 TTTTT TTTGC S3 2 2 TTTTT TTTGT S4 2 2 TTTTT TTTCT
At this point I can map forward and reverse in the two R1 and R2 files.
In this specific example I find that the read from R1
@read:111111 1:N:0:1 [CCCCC]AAAACTGACGTTGAGGGACGAAGCCTTGGGT + HHGHHHHHHHHHGFGFGHABBCCFFFFFFFGGGGGG
and the read from R2
@read:111111 2:N:0:1 [AAAAA]TTATTAGGTTCGTTGAGGGACGAAGCCTTGGGTAACG + ABGGGGGGGGGBCCFFFFFFFGGGGGHHGGHHHGGHHHGHHG
belong to the sample S1_1_1 (sample_day_site) since the read from R1 has CCCCC as primer, the read from R2 has AAAAA as primer, and the read name is the same, i.e. read:111111.
As a side note, no it is not possible to find matching headers because I need the couple (fwd, rev) to be able to match it to a sample. Without this information I can’t match the reads (R1 and R2) with its specific sample.
I tried to go through QIIME2, cutadapt, search engines, stackexchange-whatever, biostars. I also have a very ugly pipeline but it’s slow and has issues so I was wondering if I missed some tool that can do this easily. Any suggestion is very welcome.