MALT extract reads and BWA aln

So MALT is a weird beast… it’s not a taxanomic identifier in the broader sence… it’s a reference based identifier
So if your reference for example has two species of a bacteria but not the third one, most likely you will get hits going to one or the other, and generate a false positive there because you didn’t have the true positive in there. Partial hits are still hits.

So with aDNA you off course want to allow for mismatches, but that does not mean you can set your ID too low, otherwise you pass the threshold where it still works for C>T and G>A base changes etc, into the area where you are just matching mice DNA to a human (they are also about 85% identical in the boarder sense). Additionally, you have the problem of short fragments, which are already increasingly error prone as the fewer bases you have the less unique a sequence becomes.

SO! without showing any examples and input/output, i off course cannot do anything with the information you provided. I don’t even know what version of Malt you are working with or if your reads map to multiple organisms, what you species of interest are etc… so you need to provide much more detail for people to help you here…

But in general i would say 85% is waaaaaay too low. If you want confident calls, stick with 95% (this allows for 1 mismatch in a 30bp read, and up to 5 in a 100bp read, longer reads provide more confidence so it makes sense to allow mismatches only in the longer reads, and hardly any in the shorter reads). If your sample is UDG treated, this should be enough, if not, you can go maybe to 90-93%, but be warned that this will just increase the background noise…

Secondly, if you took a tiny database as your reference, you are going to have false positives… the power of MALT lies in comparative taxonomic identification… i.e. if a read matches to 20 organisms, the LCA algorithm can appoint the read to a higher up node in the taxonomic tree, essentially cleaning up your output. So if you only have 5 genomes in there, you will get matches to repeats, repetetive motives, transposons, integrases, insertion sequences, common elements shared among all organisms like 16S and 18S (remember, 85% will allow for several mismatches, so in these ribosomal areas that can be the difference between a hamster and a whale maybe)

Lastly, the reference could have issues… the famous case of i think it was the Cyprinus carpio genome… everyone gets carp in their data… so people assumed carp was just everywhere…
But it turns out the cap genome was sequenced but without adapter removal, so some generic illumina adapters have been assembled along with the genome, so if you didn’t do adapter trimming beforehand these long sequences might match to this part of the genome… But other genomes have simmilar issues, Ovis canadensis i think it was has a partial pseudomonas contig in it’s genome…. so if you have pseudomonas in your sample (common bacterium) you might get hits to this mountain goat but it’s just a informatics artefact 🙁

Read more here: Source link