Since you’re blasting with ncbi-NR you’re almost there. By changing the blastx output options to outfmt 6 you’re receiving tabular output in
avenae.out, you can customise the output columns to also receive the species names.
For example, you could run:
blastx -query /home/nkarim/avenae/trinity_even_out_dir/Trinity.300.longest.fasta -db /home/nkarim/blast/db/nr -outfmt "6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore" -evalue 1e-3 -out /home/nkarim/blast/output/avenae.out
Now you have all these extra columns,
scomnames are useful for your end as these contain the scientific name and the common name of your hits.
The laziest way to get the ‘top’ hit of your queries is to keep only the first hit seen, that normally has the lowest e-value or the highest score :
awk '! a[$1]++' avenae.out > avenae_first_hit_only.out
This falls apart when you have several species with equally good hits, you should still check whether the first hits reported are actually the best hits (manually?). From this out file you also get the percentage of genes with at least one BLAST hit:
wc -l avenae_first_hit_only.out
That will report the number of genes with at least one hit. For GO terms, I’m not sure – you could run GO annotation for your own genes, there are a few online tools for that like PANNZER2 or eggnog-mapper where you can upload your proteins.
For Table S1 (Assembly stats), Trinity has scripts for that, see this older thread Where to find the Trinity assembly statistics?
Read more here: Source link