I’m trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven’t find a simple way to do that.
In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.
Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.
For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:
The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn’t solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I’m not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).
In case someone is interested, the way I did (requires at least 10GB of RAM):
Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
Parse the sample FASTQ:
if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs