how to demultiplex paired end reads when R1 and R2 are identified by two different substrings?

I am struggling with finding a solution to a problem which seems easy but it’s not. I found many many questions that seems to be related (and I believe they are) but they are confusing and you never know which one fits your case. So there we go. I’ll try to keep it simple for future readers that are in this same situation.

What I have:

  • file_R1 containing lots of mixed samples reads
  • file_R2 containing lots of mixed samples reads
  • table.csv containing 3 columns: sample_name, forward_string, reverse_string

Table.csv has many rows, one for each combination of forward_string and reverse_string.

For example: I have 10 forward_string and 10 reverse_string. This means that I have 100 samples (10 forward_string x 10 reverse_string = 100 combinations).

Each sample is characterised by two reads, R1 and R2 that are found by a combination of forward_string[1..10] – reverse_string[1…10], call it (fwd, rev). Ideally, once I find R1 and R2 for sample_i (with 1 <= i <= 100) these reads will have the same header: bingo! I now have all the info I need to determine from which sample these two reads come from.

Each read (it does not matter if in file_R1 or in file_R2) may contain one of these:

  • forward_string
  • reverse_string
  • no_string <- this is a sample that I will NOT consider

file R1.fastq contains forward reads in this form:

@read:111111 1:N:0:1
AAAAAAAAACTGACGTTGAGGGACGAAGCCTTGGGT
+
HHGHHHHHHHHHGFGFGHABBCCFFFFFFFGGGGGG
@read:9085 1:N:0:1
CGCAGAGTGATGGCCAGCCGCCCGTGAAATTCCCGGGCTCAACC
+
AA@A??AA/BAF21FA/BEFGCCCGAFGADAFFF1BFG1F1EEC
@read:7634 1:N:0:1
CGATACGTGTGCCAGCAGCCGCGGTAATACGTAGGTGGCA
+
GAFGFG?EFHHAGGGGEGGGGFBEFGEB@-->--/@@@@B

file R2.fastq contains reverse reads in this form:

@read:111111 2:N:0:1
CCCTGTTATTAGGTTCGTTGAGGGACGAAGCCTTGGGTAACG
+
ABGGGGGGGGGBCCFFFFFFFGGGGGHHGGHHHGGHHHGHHG
@read:6576 2:N:0:1
CGCAGAGTGTGCCAGCCGCCGCGGTAATACGAAGGGGGCT
+
GAFGFG/>//?EAF/:FBF??B-FB-/BB@--B/BB9A--
@read:3457 2:N:0:1
CGATACGTGTGCGGTCGTCAAGTACTAAACTGTAACTGACGCT
+
BBBBBBBFFFB..099AC.:.CFF0BF/////;/;9;///.;.

the matching table is like this:

sample day   site   forward reverse
  S1    1       1   AAAAA   CCCCC
  S2    1       1   AAAAA   TTTAA
  S3    1       2   AAAAA   TTTAG
  S4    1       2   AAAAA   TTTAC
  S1    2       1   TTTTT   TTTGA
  S2    2       1   TTTTT   TTTGC
  S3    2       2   TTTTT   TTTGT
  S4    2       2   TTTTT   TTTCT

At this point I can map forward and reverse in the two R1 and R2 files.

In this specific example I find that the read from R1

@read:111111 1:N:0:1
[CCCCC]AAAACTGACGTTGAGGGACGAAGCCTTGGGT
+
HHGHHHHHHHHHGFGFGHABBCCFFFFFFFGGGGGG

and the read from R2

@read:111111 2:N:0:1
[AAAAA]TTATTAGGTTCGTTGAGGGACGAAGCCTTGGGTAACG
+
ABGGGGGGGGGBCCFFFFFFFGGGGGHHGGHHHGGHHHGHHG

belong to the sample S1_1_1 (sample_day_site) since the read from R1 has CCCCC as primer, the read from R2 has AAAAA as primer, and the read name is the same, i.e. read:111111.

As a side note, no it is not possible to find matching headers because I need the couple (fwd, rev) to be able to match it to a sample. Without this information I can’t match the reads (R1 and R2) with its specific sample.

I tried to go through QIIME2, cutadapt, search engines, stackexchange-whatever, biostars. I also have a very ugly pipeline but it’s slow and has issues so I was wondering if I missed some tool that can do this easily. Any suggestion is very welcome.

Source link