Merging compressed fastq files based on a conditions defined in a csv file

Hello everybody,

I have a question quite different about similar topic addressed on: Post not found

I tried Paul’s bash script in the web indicated above (fastq_lane_merging.sh) adapting to my filename organization data being:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | cut -d "_" -f3 | ; done | sort | uniq)

    do echo "Merging R1"

cat *_*_"$i"_1.fastq.gz > "$i"_AL3936_R1.fastq.gz

       echo "Merging R2"

cat *_*_"$i"_2.fastq.gz > "$i"_AL3936_R2.fastq.gz

done;

but it does not work.

Let me expose briefly my problem:

I’m newbie in bioinformatics and sorry for this basic question. From a shotgun analysis each sample has been analysed by paired-end strategy in different number of lanes.The sequencing center has not used the option –no-lane-splitting for Illumina’s bcl2fastq program to get a single file per sample automatically. As consequence, as raw data, I have the following filenames in a folder called raw data:

Raw data folder contains the following files:

H3CH3DRXX_1_109UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_1_109UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_1_97UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_1_97UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_2_109UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_2_109UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_2_97UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_2_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_1_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_1_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_1_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_1_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_2_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_2_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_2_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_2_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_3_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_3_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_3_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_3_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_4_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_4_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_4_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_4_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_1_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_1_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_2_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_2_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_4_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_4_97UDI-idt-UMI_2.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_1.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_2.fastq.gz

And through code in bash I would like to using the next file attached below (match_ids.csv) that links sample name code (starting with “AL) towards to different flowcellmultiplexcode (code starting with letter “H”) that appears above in fastq.gz filename as you can see:

match_ids.csv file:

left column (sample id), right column: string from the original filename 
that belongs from the sample indicated in left column.

    AL3936  H3CH3DRXX_1_97UDI-idt-UMI
    AL3936  H3CH3DRXX_2_97UDI-idt-UMI
    AL3936  HGTVGDSXX_1_97UDI-idt-UMI
    AL3936  HGTVGDSXX_2_97UDI-idt-UMI
    AL3936  HGTVGDSXX_3_97UDI-idt-UMI
    AL3936  HGTVGDSXX_4_97UDI-idt-UMI
    AL3936  HGV35DSXX_1_97UDI-idt-UMI
    AL3936  HGV35DSXX_2_97UDI-idt-UMI
    AL3936  HGV35DSXX_3_97UDI-idt-UMI
    AL3936  HGV35DSXX_4_97UDI-idt-UMI
    AL3936  HGV52DSXX_1_97UDI-idt-UMI
    AL3936  HGV52DSXX_2_97UDI-idt-UMI
    AL3936  HGV52DSXX_3_97UDI-idt-UMI
    AL3936  HGV52DSXX_4_97UDI-idt-UMI
    AL3937  H3CH3DRXX_1_109UDI-idt-UMI
    AL3937  H3CH3DRXX_2_109UDI-idt-UMI
    AL3937  HGTVGDSXX_1_109UDI-idt-UMI
    AL3937  HGTVGDSXX_2_109UDI-idt-UMI
    AL3937  HGTVGDSXX_3_109UDI-idt-UMI
    AL3937  HGTVGDSXX_4_109UDI-idt-UMI

Therefore, I would like to create a bash script to merge all forward (and after all reverse) fastq.gz files belonging to the same sample name (as defined in csv file) and the merged fastqfile containing the sample name code (starting of “AL”).

    To illustrate the problem, let me show you an example of purpose of my question:

    **From the next forward fastq files:**


      H3CH3DRXX_1_97UDI-idt-UMI_1.fastq.gz
    H3CH3DRXX_2_97UDI-idt-UMI_1.fastq.gz
    HGTVGDSXX_1_97UDI-idt-UMI_1.fastq.gz
       HGTVGDSXX_2_97UDI-idt-UMI_1.fastq.gz
     HGTVGDSXX_3_97UDI-idt-UMI_1.fastq.gz
      HGV35DSXX_1_97UDI-idt-UMI_1.fastq.gz
    HGV35DSXX_2_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_1.fastq.gz
     HGV35DSXX_4_97UDI-idt-UMI_1.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_1.fastq.gz 

         **Output desired:**
    Only one fastq.gz file containing the above 10 fastq.gz
    files merged and with the next filename:
            AL3936_1.fastq.gz

Thanks on advance for your help and hints,

Read more here: Source link