Create de novo repeat library

Tutorial for de-novo repeat library construction

The RepeatMasker software includes a lot repeat library. You can query them using:

queryTaxonomyDatabase.pl -h   
queryRepeatDatabase.pl -h

If there is no repeat library available for your species, you may want to create your own.
Prerequisite

  • Lot of time (the repeatmodeler and transposonPSI steps can run for days depending provided resources)
  • RepeatModeler (installation might be difficult **/! do not use the current recipe (build1) in bioconda, it doesn’t work properly** ) Using Conda use repeatmodeler-1.0.11 build pl526_2 or superior (Previous build is bugged).
  • transposonPSI
  • ProtExcluder
  • blastp, blastx
  • gaas_fasta_removeSeqFromIDlist.pl from GAAS.
  • The fasta genome for which you want to define the repeats

1) De-novo – RepeatModeler:

/! RepeatModeler uses RepeatMasker for classification steps at the end. Without a complete installation of RepeatMasker you will end up with the file consensi.fa instead of consensi.fa.classified. So, if you installed RepeatModeler by conda you will get this error Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!. Indeed this nucleotide library is not included by default. People tend to use RepBase as DB but it requires a license since last year. SO, if you wish to perform this classification step successfully please add a DB. see here for other details.

BuildDatabase –name genome -engine ncbi genome.fa
RepeatModeler –database genome -engine ncbi

You can use the option –pa to parallelise and speed it up a bit. This step is the longest step.
At the end of this step you should have a file called consensi.fa.classified.

2) Filtering repeats:

The de-novo identification has a major drawback. Repeats are not always derived from ‘junk’ in the genome, but can also be part of actual protein-coding genes. It is therefore recommended to check the repeats against a comprehensive set of ‘real’ proteins from related organisms. If you are unsure what protein data set to run this against, simple use the one you were going to use for annotation. We call it <proteins.fa> here.

2.1)Mine (Retro-)Transposon protein Homologies.

transposonPSI.pl <proteins.fa> prot

You should get <proteins.fa>.TPSI.topHits file as output. From the resulting list, a collection of accession numbers with similarities to transposons can be generated.

awk '{if($0 ~ /^[^//.*]/) print $5}' <proteins.fa>.TPSI.topHits | sort -u > accessions.list

2.2) Remove TEs from proteome.
fasta_removeSeqFromIDlist.pl is from the GAAS repo.

fasta_removeSeqFromIDlist.pl -f <proteins.fa> -l accessions.list -o proteins.filtered.fa

2.3) Blast proteome against RepeatModeler library

makeblastdb –in proteins.filtered.fa –dbtype prot
blastx –db proteins.filtered.fa –query consensi.fa.classified –out blastx.out

you can use the –num_threads parameter to speed up the blasts step.

2.4) Remove hits from RepeatModeler library
Remark: The ProtExclider Manual says The package was developed using blastx output from ncbi-blast-2.2.28+ but is compatible with ncbi-blast-2.4.0+. I tried with blast 2.9.0+ and I got Illegal division by zero error. I tried blast 2.7.1+ and it works.

ProtExcluder.pl blastx.out consensi.fa.classified

The result should be a filtered repeat library called consensi.fa.classifiednoProtFinal. You can rename it or symlink it to the name of your choice e.g myrepeatlib.fa.

Read more here: Source link