the way to find pseudogenes in my scaffolds of genome

the way to find pseudogenes in my scaffolds of genome



Recently, I assembled genome of a beetle living underground. My interest is searching pseudogenes (premature stop codon or frameshifts).

I conducted tblastn using a few genes of model organisms (Tribolium) as query. But no premature stop codons were found in homology sequences.
So I try to conduct tblastn using all genes of Tribolium (about 18000 counts) and confirm whether premature stop codon exist in homology sequences. But obviously it is unrealistic.

Do you know any idea or useful software to find pseudogenes, based on homology to known genes (Tribolium, in my case)?

Thank you!



your general thinking is not bad. It’s actually good 🙂

but you might be missing the fine details of it. The Tblastn approach is an excellent start. However, blast is a local aligner, so it does not have to align the full query protein to report a hit. It can very well be that the end of the proteins differ (eg. increased mutation due to in frame stop codon ) and they might thus not be aligned, hence not reported by blast and inspecting the results you could miss them. To overcome this you could consider to use a more global aligner, more like a mapper. (perhaps something like genomeThreader, genewise, …) They might be able to ‘align’ past the inframe stopcodon.

Back in the days (years ago 🙁 ) I once in a while use a tool called pseudopipe to detect pseudogenes. In essence it applied somewhat your approach. Not sure though how maintained (available?) it still is.

I am not sure if this serves well but you may need a reverse strategy to do this.

Instead of aligning all pieces it may be better to generate a shotgun method to do the work.

Here are the steps.

  • Use a NGS short read simulator to generate pseudoreads from your contigs/scaffolds.
  • Map them onto the model organism.
  • Call variants and check especially the ones that generate frameshifts, stop gains etc.
  • Check those reads that mark the stopgain/frameshift etc. and find the source of that read from the contigs.
  • Make a list of possible candidates and rinse and repeat.

I am totally making it out of my mind now but this may be one of my first ways ot trying this out.

Also blatting your way may help.

I’d agree with Lieven in general. I would try the following approach using TBlastN and exonerate.

  • Use the AA sequences of your gene of interest as templates
  • Use TBlastN to find contigs with hits to your templates.
  • Extract these contigs from the assembly and use only them for the next step (this is only to speed up exonerate)
  • use exonerate with task protein2genome and let it return the hypthetical cDNA sequence (via -ryooption) and inspect the generated alignment,
  • is the cDNA and translation into AA shorter than the template? That might mean, a) your scaffold is fragmented, b) your coding sequence is shorter, c) frameshifts d) you have a lot of N’s in your sequence
    (use a recent version of exonerate which has multithreading support, it will still be slow)
  • run the cDNA sequence or the genomic regions that flank it, too, through EMBOSS transeq in all six reading frames, that will give you orfs but also include stop codons
  • if you find candidates that way, you might want to align them globally back to the template and compare

before adding your answer.

Traffic: 3005 users visited in the last hour

Read more here: Source link