Illumina Trimming Algorithm

Our group (The UC Davis Bioinformatics Core) has some internal tools we’ve developed. The first is Scythe, an adapter trimmer. This uses the quality information in a FASTQ entry and a prior to decide whether a 3′ substring is adapter. Very basically, it takes a naïve Bayesian approach to classifying 3′-end contaminants only. Because these are the most poor quality bases and most likely to be contaminated (especially as reads get longer and longer), Scythe is designed to specifically remove these contaminants. Removing other contaminants can be done with other tools. The prior can be set to different thresholds; I recommend using less to get a sense of the 3′-end adapter contaminant rate. If you’re doing an assembly, you may desire very, very strict trimming, i.e. if the adapter contamination seems high, and the 3′-end adapter begins with GATC, removing all 3′-end GAT, GA, and GATC substrings (as well as all the longer more likely matches). We find this works well in our projects, but and feedback is welcome.

[Sickle] is a sliding window quality trimmer, designed to be used after Scythe. Unlike cutadapt and other tools, our pipelines remove adapter contaminants before quality trimming, as removing poor quality bases throws away any useful information that could be used in identifying a 3′-end adapter contaminant. Thus, our quality control system works by first looking at preliminary quality checks via my qrqc package, then running Scythe to remove adapters, then do quality trimming with Sickle, then do another qrqc run to see how the sequences have changed.

Sometimes Scythe seems a bit greedy, but upon further inspection it’s almost always “too” greedy with poor quality sequences, which would be removed by Sickle anyways.

Nik (a member in our group) also wrote Sabre, a bardcode demultiplexing and trimming too (I think this is still alpha though).

Any feedback would be appreciated!

Read more here: Source link