Whole genome species clustering

I don’t think there is a “best” approach in your situation, given

1) the taxonomy of the group is probably still flaky.

2) being draft genomes, they probably are spotty and have sizeable gaps, and maybe even contaminants. Repeat regions are often left out of such assemblies.

Actually, a good workflow would perform different types of analyses, in order to get a more complete and solid overall picture about the genus. Some suggestions:

ReferenceSeeker will give you the closest species, in terms of kmer (min-hash) distances and average nucleotide identity (ANI). You can easily find if there are some really close genomes with it.

Mashtree uses the same kmer (min-hash) distances to group genomes in a dendrogram. The authors don’t consider this dendrogram a phylogeny, but I think this dendrogram probably reflects the phylogenetic history of the genomes, anyway. Thus, it may complement a more traditional core genome phylogeny. Whole genome alignments probably aren’t a good option, unless all genomes are really close (and, I would argue, also really high quality). Mashtree works as a substitute for whole genome alignments.

A core genome phylogeny would complement the above analyses (but I don’t think Roary works for fungal genomes). As there aren’t many genomes available, you can get a SNP-based phylogeny, or even a maximum-likelihood or Bayesian phylogeny with all genes concatenated. Running the same version of BUSCO on all genomes would also give a good dataset for a phylogeny, in addition to informing on the overall quality of the genomes.

Read more here: Source link