Hello a newbie here,
I am reanalyzing an article (GSE83931) for training purpose. I have two concerns/question.
1- I performed FASTQC on the sequences followed by multiqc. When I look at the reports individually it doesn’t show any adapter sequence. (please see pic1). (Authors reported the they used Trimmomatic to remove them). I can see adapter in the multiqc report (pic2). Pictures belong to the same run. .
How can we explain the discrepancy here?
2- They reported that TruSeq3-SE.fa adapter sequence was removed by Trimmomatic. I used cutadapt instead. The adapter sequence (based on the FASTQC report) I found online corresponds to : AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
I used following command line parameters:
cutadapt -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o SRR3734812_trim50.fastq.gz --length-tag 'length=" SRR3734812.fastq.gz
Output:
This is cutadapt 1.18 with Python 3.7.6 Command line parameters: -a
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o
SRR3734812_trim50.fastq.gz --length-tag length= SRR3734812.fastq.gz
Processing reads on 4 cores in single-end mode ... Finished in 709.18
s (28 us/read; 2.16 M reads/minute).
=== Summary ===
Total reads processed: 25,562,072 Reads with adapters:
783,598 (3.1%) Reads that were too short: 0 (0.0%)
Reads written (passing filters): 25,562,072 (100.0%)
Total basepairs processed: 2,556,207,200 bp Total written (filtered):
2,553,044,075 bp (99.9%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA; Type: regular 3";
Length: 34; Trimmed: 783598 times.
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp:
3
Bases preceding removed adapters: A: 24.0% C: 31.0% G: 29.6%
T: 15.5% none/other: 0.0%
Overview of removed sequences length count expect max.err error counts
3 529182 399407.4 0 529182 4 116588 99851.8 0 116588
5 39583 24963.0 0 39583 6 16724 6240.7 0 16724 7 14190 1560.2 0 14190
8 12594 390.0 0 12594 9 11809 97.5 0 11202 607 10 10917 24.4 1 10045
872 11 9490 6.1 1 9007 483 12 8432 1.5 1 8112 320 13 7396 0.4 1 7214
182 14 6684 0.1 1 2 6682 15 8 0.0 1 0 8 17 1 0.0 1 0 1
After trimming I performed FASTQC again on the same sequence. Apparently, it did something as the sequence length is now 83-100 (pic3). When I compare the first 3-4 reads from before and after trimming, it looks same. How can I validate trimming step ?
A naïve question: Should all reads have a adapter or only some of them have adapters? (because in the report it say 3% of the runs have adapter) Although not mentioned in the article, could authors upload already trimmed sequences to GEO?
Thank you for your time!
Read more here: Source link