How to quantify piRNAs ?
Hi
I’m trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven’t find a simple way to do that.
In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0
). Indeed the drosophila piRNA reference contains >41 million sequences.
Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.
For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA
perfectly matches many piRNAs in the piRBase:
>piR-dme-179
TGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-183
CTGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-800
TGCTTGGACTACATATGGTTGAGGGTTGTAA
>piR-dme-46776
ATGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-48715
GCTGCTTGGACTACATATGGTTGAGGGTTGTA
...
For that case I would like to count only piR-dme-179
because it has no additionnal bases. Are there tools that could help efficiently do that ? Maybe using a reciprocal matching criterion.
I hope you can help, thank you
• 406 views
Thank you.
The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn’t solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I’m not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).
In case someone is interested, the way I did (requires at least 10GB of RAM):
- Parse the piRBase FASTA to create a map:
<sequence> -> <sequence name>
(e.g. with a dict in Python or a list in R) - Parse the sample FASTQ:
- if you have UMIs to account for PCR bias, create a map
<sequence> -> seen UMIs
: the count will be the number of unique UMIs - otherwise map
<sequence> -> number of appearances
- if you have UMIs to account for PCR bias, create a map
- Write the count table from these 2 mappings
Traffic: 2253 users visited in the last hour
Read more here: Source link