Hi Everyone,
I just released a new version of BBTools (39.03). There are some exciting new features like neural networks. But let me list them:
1) New program called “bbcrisprfinder.sh”. It finds CRISPRs… designed for short reads in metagenomes (but it’s also OK on full genomes). It works better if you supply it a reference list to which to align all the suspected CRISPRs. Or use an iterative approach.
2) “checkstrand.sh”. I think it is fantastic. The goal is to analyze RNA-seq data and determine how stranded it is. It’s probably more relevant to institutions like JGI that try various protocols and need to evaluate them, but… if you ever have a (supposedly) strand-specific library and want to know how successful it was, this could help! The important point is that it does not require a reference, and thus works on metagenomes or novel species.
Checkstrand works on anything – proks, euks, shotgun, RNA-seq, etc. The more information you can provide, the better, but it still works on raw, unannotated sequence. …but, for example, if you have a bacteria that has been assembled, you should use something like checkstrand.sh in=reads.fq ref=assembly.fa passes=2
. Checkstrand has multiple different metrics for strandedness, and is designed to work without an assembly, on random metagenomic data… but, you can get more detailed information with a reference, and (for proks, at least) even more detailed information, without a gff. Although you CAN set a gff with “gff=” and it will be used instead of the internal prok-only gene-caller.
Basically, the point is to glean all possible information on library strandedness (such as kmer frequency, stop codon frequency, etc) from individual reads or read pairs, and report it to the user.
3) SIMD vectorization. You can enable this in some programs via the “simd” flag. It makes no difference except in nn-heavy programs like bbmerge or train.sh. But, it makes some programs 2.5x faster.
4) Neural Networks. I’m working on this. It makes BBMerge much more accurate. Fortunately with BBMerge I can generate all the synthetic data I want, but it’s much more difficult for training things like CRISPRs and gene-calling where the correct answer is not known.
5) Oh… and there are some other new things like callgenes.sh now allows a “passes=2” flag. You can do however many passes you want, but it seems to converge after 3 passes, which is almost identical to 2 passes. The way it works is that it calls genes based on a model made from every prokaryotic genome in RefSeq that have annotations (an accompanying gff file). I thought it worked really well, but 2-pass mode was dramatically better. It now calls genes using the existing frequencies, and gets them ~95% concordant with the NCBI annotations (which may or may not be correct, but up from 92%). Now, with a second pass, it recodes the coding and noncoding hexamer probabilities, and all the hexamers around starts and stops, according to the organism being analyzed.
Anyway, I’d make passes=2 the default except that it is not very useful for metagenomes, so just enable that if you have an isolate prokaryote.
Read more here: Source link