Is it possible to isolate and merge overlapping hits from a BLAST search Aligned Fasta file?

Hello all,

Hope you’re well. I started a PhD in Jan this year and frankly I am struggling. I’m coming into the second week of trying to figure out how to attempt this problem and it’s honestly starting to get to me a bit – the current task I’m working on is not my main project and every day I don’t make progress is a delay on my actual project. So! A bit worried to say the least (!)

Here is the issue: I ran some sequences using BLASTn. I downloaded the Aligned Fasta file for each. I have sorted reads by if they are a single read (ie the accession number only appears once) or multiple reads of the same entry (example below, edited for brevity):

SINGLE HIT

>XM_040233432|1:208-654_PREDICTED:_Oryx_dammah_lysozyme_C_kidney_isozyme_(LOC120857726)_mRNA
ATGAAGGTTCTCCTTATTCTGGGGCTTCTCCTCCTTTCAGTCGCTGTCCAAGGCAAGGTCTTTGAGAGATGTGAGCT...

MULTIPLE HIT

>XM_042246932|1:88-371_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
TGAGGGCTGTCCTTATTCTGGGGCTTCTCCTCCTTTCTGTCACAGTCCAGGGCAAGAAATTTCAGAGGTGTGAGCTTGCC..
>XM_042246932|1:388-668_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
GGTTGTGTTTGACCAAATGGGAAAGCAGTTATAACACAAAAACGACAAACTACAATCCTAACAGTGAAAGCACTGATTAT...

This is so I can use the seqkit to combine the multiple hit entries. Once I’ve done that, I’ll be filtering based on query sequence length (using bioawk which I know how to do) and carrying on with the pipeline (alignment, analysis, ect.).

The problem is, some of these multiple hits are overlapping and I have no idea how to treat them! For example:

>XM_023558748|1:**49-362**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
ATGAAGGCTCTCCTTATTCTGGGCCTTCTCTTCCTTTCTGGCACTGCCCAGGGCAAGGTCTTTGAAAGGTGCGAGTTGGC..

>XM_023558748|1:**349-416**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
GGGTGGCATGGAGAAACCATTGTCAGAACCACGATGTCACTCAGTACATTCAAGGCTGTGGAGTCTAA

My two questions:

  1. Is there a way I could isolate these overlapping sequences from the nonoveralpping multiple hits? I just can’t fathom how I can achieve that right now with my current skill set.

  2. Aside from using the EMBOSS merger command, is there another way for me to merge these overlapping files together? I did give it a go and it had wonderfully merged the above Elephant example correctly. But I am going to be dealing with thousands of hits, is it possible to write a python script to run merger through the overlapping files? I realise this is a tall order. I’d try to configure this script to use seqkit to merge the non-overlapping multiple hits too.

Thank you kindly for reading – I have trawled these forums the last few months and frankly I only think I’ve managed to get where I am thanks to your answers. Any advice, suggestions or critiques are gratefully received.

Read more here: Source link