Hi,
I’m willing to use STAR for bacterial genomes. I wanted to ask if this is strongly unadvised or if there is a way to manage the main challenges of mapping reads to prokaryotes. (I know there are specific tools for this purpose, i.e. EdgePro, but I’m a beginner in bioinformatics and it would significantly simplify the execution of the methodology as STAR is the only tool I’ve used and it has a very complete manual.)
I obtained the input data from antismash (GBK) and converted it to GFF3 and FASTA formats. In the third column of the GFF, naturally, there is no Exon feature. However, CDSs correspond to the genes annotated for each sm-BGC (biosynthetic gene cluster that code for specialized metabolites) which is the only feature of interest in my research. I tried to specify this when generating the index with:
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir path_to_genome_directory --genomeFastaFiles path_to_fasta.fa --sjdbGTFfile path_to_GFF.gff --sjdbOverhang 149 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS
But it still appeared the following error message: Fatal INPUT FILE error, no exon lines in the GTF file (…)
I also realised that the ninth column of the GFF does not contain “transcript_id” or “gene_id”, which is noted as a request in the thread of Issue #84 of the tool’s github. How can I overcome this? Is it enough to replace “ID” for “transcript_id”? Here a line of how it looks like:
ISL001_ctg1 GenBank CDS 8645 8890 . - 1 ID=ctg1_124; nRPS_PKS=Domain: PP-binding (3-72). E-value: 5.9e-16. Score: 50.6. Matches aSDomain: nrpspksdomains_ctg1_124_PP-binding.1, type: other; Name=ctg1_124; gene_functions=biosynthetic-additional (rule-based-clusters) PP-binding; gene_kind=biosynthetic-additional; sec_met_domain=PP-binding (E-value: 3.3e-15%2C bitscore: 50.6%2C seeds: 164%2C tool: rule-based-clusters); transl_table=11; translation=length.81
More troubling, the ID refers to the name of the gene but not to the genome where it came from: there might be i.e. ISL001_ctg1_124 and ISL002_ctg1_124 and both ID’s will be the same. Does the first column account for this, without needing any more details in the ninth to properly map?
Cheers,
Constanza