k-mer counters – presence/absence matrix

Hi lizabe,

You’re right that this tutorial is out of date. The --matrix option is no longer valid as an option to jellyfish count. However, I don’t think it’s original intent was to do what you wanted anyway. It doesn’t write out a binary presence/absence matrix. Rather, it specifies the binary matrix that is used to generate the universal hash function for hashing the k-mers. Jellyfish relies on a universal hash function, which can be generated using a random binary matrix. If you want to use the exact same hash function for other purposes, you need to know what that matrix is.

Anyway, to achieve what you want, I’m afraid you’ll need to take a different approach. Essentially, what you want to do is to count k-mers in a collection of different fasta files / genomes, and then determine which k-mers are present in each. With jellyfish, you could do this by running jellyfish separately on each input genome, then using the dump command to get the k-mer list for each in plain text, and then merging across the files to get the matrix. Alternatively you could use a tool like mantis (disclosure; I’m a senior author of this method) or metagraph that are designed explicitly to be able to answer k-mer presence/absence queries over a large collection of k-mers coming from different sources (among other things).

Perhaps kmer-counter or kmer-boolean would be of use for kmers shorter than 31 characters:

The kmer-counter repo contains a script to demonstrate Python integration for quick filtering/querying.

For longer kmers, a tool like Jellyfish would be appropriate.


Login
before adding your answer.

Traffic: 1854 users visited in the last hour

Source link