Tag: locus_tag

Renaming fasta files with their headers

Renaming fasta files with their headers 1 Hi I have around 85 gene sequences in individual fasta files. I’d like to rename each file with their header name containing the gene name in [gene=]. For each header, I only want what is in-between the brackets. I’m trying to do this…

Continue Reading Renaming fasta files with their headers

Help with htseq -count read counts

Hello I am doing a transcriptome analysis on Pseudomonas putida and I have been trying to do a read count using Htseq -count. The program always give an error. I have tried different genome references (fna) and annotation files (gtf ang gff) but it does not work. The mapping works…

Continue Reading Help with htseq -count read counts

phage genome submission in ncbi genebank

phage genome submission in ncbi genebank 0 I annotated a bacteriophage genome using prokka against PHROGS database, and used table2asn to create a .sqn file. But it gave error messages like “SEQ_FEAT.GeneXrefWithoutGene, SEQ_FEAT.BadEcNumberFormat, SEQ_FEAT.BadProteinName” Error: valid [SEQ_FEAT.GeneXrefWithoutGene] Feature has gene locus_tag cross-reference but no equivalent gene feature exists FEATURE: tRNA:…

Continue Reading phage genome submission in ncbi genebank

How do you add an ORF that overlaps the two regions where a circular genome is cut in Genbank?

How do you add an ORF that overlaps the two regions where a circular genome is cut in Genbank? 1 Dear Biostars community, I had a question regarding the BankIt (Genbank) submission for circularized genomes. Let’s say I have a circularized genome from 1 to 100000 bp. And I also…

Continue Reading How do you add an ORF that overlaps the two regions where a circular genome is cut in Genbank?

Find data-based Gene_IDs for unknown gene_IDs in gtf.file

Find data-based Gene_IDs for unknown gene_IDs in gtf.file 0 Hi all, Following the RNA-seq analysis workflow, I am trying to find the GO gene ontology terms for a list of DGEs output of (FeatuCounts > edgeR). I conducted the RNA-seq analysis using either RAST-annotated gtf, or NCBI-PGAP gft files. 1…

Continue Reading Find data-based Gene_IDs for unknown gene_IDs in gtf.file

GFF/GTF file error / featureCounts

Hi all, I am trying to generate a count.matrix for sorted bam files, using featureCounts on linux. I have a non-modal organism (bacteria), so I generated the annotation.file using both PROKKA and RAST. I used all the following files in featurecounts; PROKKA.gff, RAST.gff RAST.gtf gffread converted-PROKKA.gtf file But still facing…

Continue Reading GFF/GTF file error / featureCounts

Extract genes within a genome using eUtils

Extract genes within a genome using eUtils 0 Hi, I have a bunch of EC numbers based on which I would like to download the corresponding genes from several species. An example of a search would be “Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]” and if I use: esearch -db nuccore -query “Bacillus[ORGN]…

Continue Reading Extract genes within a genome using eUtils

bash command to process a line

bash command to process a line 1 Hi, I have a weird .txt file with this line. lcl|CU459141.1_prot_CAM87240.1_2248 – TniQ PF06527.14 0.018 13.6 0.0 0.024 13.2 0.0 1.1 1 0 0 1 1 1 0 [locus_tag=ABAYE2390] [db_xref=EnsemblGenomes-Gn:ABAYE2390 I need to process the line into 2 columns like following: CU459141.1 CAM87240.1…

Continue Reading bash command to process a line

Error parsing strand (?) from GFF line

Error parsing strand (?) from GFF line 0 I am trying to assemble RNA transcripts using stringtie and facing the following error. Error parsing strand (?) from GFF line: NC_037304.1 RefSeq gene 58315 59481 . ? . ID=gene-DA397_mgp34;Dbxref=GeneID:36335702;Name=nad1;exception=trans-splicing;gbkey=Gene;gene=nad1;gene_biotype=protein_coding;locus_tag=DA397_mgp34;part=2 my comand is : stringtie -p 8 -G Genome/arab_thaliana.gtf -o Assemble/NR1.gtf –l…

Continue Reading Error parsing strand (?) from GFF line

How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix

How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix 0 Hi all, I checked the gtf.file for my reference genome (bacteria/ downloaded from NCBI), and it looks like there are missing some gene names but gene_IDs,…

Continue Reading How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix

Are there any tools that can create a very basic GTF file from contig sequences (no annotations really needed) ?

If anyone still needs help with this, you can use a SAF file as an option with featureCounts. Here’s a script from my VEBA suite github.com/jolespin/veba/blob/main/src/scripts/fasta_to_saf.py Can easily adapt to not require soothsayer_utils below. #!/usr/bin/env python from __future__ import print_function, division import sys, os, argparse import pandas as pd from…

Continue Reading Are there any tools that can create a very basic GTF file from contig sequences (no annotations really needed) ?

How to parse a gene’s location using biopython

How to parse a gene’s location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle =…

Continue Reading How to parse a gene’s location using biopython

Parsing gene location using biopython

Parsing gene location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle = Entrez.efetch(db=”nuccore”, id=’NC_000913′, rettype=”fasta_cds_na”,…

Continue Reading Parsing gene location using biopython

PROKKA.gff file is not compatible with featureCounts

Hi all, I am trying to count the number of reads that map to each gene using FeatureCounts. (RNA-Seq PE, linux) my input; GFF. file generated using Prokka GTF.file generated by NCBI annotation Sorted.bam files generated by bowtie2 and samtools. When I used gtf.file generated by NCBI, featurecounts run without…

Continue Reading PROKKA.gff file is not compatible with featureCounts

genbank parsing using perl

genbank parsing using perl 1 I have genbank file. and want to retrieve relevant information. the problem is when the code fetches data related to product, it considers only one line information. since the while loop read data line by line. #! /usr/local/bin/perl -w open (GB,”$ARGV[0]”); open (AC, “>$ARGV[1]”); print…

Continue Reading genbank parsing using perl

gff file from NCBI RefSeq GCF dataset has an invalid format

Thank you for noticing this. It is indeed an issue in the GFF3 file. The root of the problem is it’s a gene that is impossible to correctly represent in GFF3 because it incorporates sequence from both strands via trans_splicing. The complexity of this gene can be seen on the…

Continue Reading gff file from NCBI RefSeq GCF dataset has an invalid format

“Error parsing strand (?) from GFF line” happenning in gffread, stringtie and cufflinks

“Error parsing strand (?) from GFF line” happenning in gffread, stringtie and cufflinks 0 Hi! I’m working with various genomic data and while trying to use gffread, stringtie and cufflinks I went through the same error: Error parsing strand (?) from GFF line: NC_037304.1 RefSeq gene 58315 59481 . ?…

Continue Reading “Error parsing strand (?) from GFF line” happenning in gffread, stringtie and cufflinks

How to identify locus_tag by using RefSeq protein info (WP_*)

How to identify locus_tag by using RefSeq protein info (WP_*) 0 Hi, I would like to know the locus tag of a protein annotated with RefSeq (WP_*). For example, I would like to identify the genomic location of a protein (WP_073031595.1) and also know its adjacent proteins. The GenBank file…

Continue Reading How to identify locus_tag by using RefSeq protein info (WP_*)

can gff2 reference used in htseq-count?

Dear all We are recently working with E.coli plasmid and tried to summarize the gene counts from our RNA-Seq samples. The short reads were mapped to E.coli plasmid using tophat which generated bam files accordingly. However, we were unable to obtain a gff3 version of our target plasmid genome, the…

Continue Reading can gff2 reference used in htseq-count?

Parsing GenBank file: get locus tag vs product

As your sample GenBank file was incomplete, I went online to find a sample file that could be used in an example, and I found this file. Using this code and the Bio::GenBankParser module, it was parsed guessing what parts of the structure you were after. In this case, “features”…

Continue Reading Parsing GenBank file: get locus tag vs product

The meaning of greter than character (>) in gene position in Genbank files

The meaning of greter than character (>) in gene position in Genbank files 1 Hello.This character made some issues when I used Genbank files’ contents.Here an example of ‘>’ usage in a Genbabk file: gene 957467..>957886 /locus_tag=”BME_RS04610″ /old_locus_tag=”BMEI0926″ I couldn’t find what ‘>’ signifies. Does anyone knows? genbank • 120…

Continue Reading The meaning of greter than character (>) in gene position in Genbank files

How to extract two genomic location numbers within the following fasta header?

How to extract two genomic location numbers within the following fasta header? 0 I am wondering how to extract the two numbers within the location tab of the following fasta header. >lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS] fasta extract location genomic bash • 42 views • link updated 34…

Continue Reading How to extract two genomic location numbers within the following fasta header?

How to extract genomic upstream region of a protein identified by its NCBI accession number?

How to extract genomic upstream region of a protein identified by its NCBI accession number? 1 I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene’s nucleotide sequence. I will be thankful to you if you can…

Continue Reading How to extract genomic upstream region of a protein identified by its NCBI accession number?

does not contain a ‘gene’ attribute

htseq-count returns : does not contain a ‘gene’ attribute 1 Dear BIOSTAR community, I’m trying to make count matrix with htseq-count, htseq-count -s yes -t gene -i gene 01.sorted.sam annotation_cattle.gff > 01.txt even with –idattr=gene , it returns error: Error processing GFF file (line 1864255 of file annotation_cattle.gff): Feature gene-D1Y31_gp1…

Continue Reading does not contain a ‘gene’ attribute

Download nucleotide sequence with locus_tag

Download nucleotide sequence with locus_tag 1 I have a list of locus_tag, my idea was to download them using esearch but the downloaded file is not the desired gene, instead the nucleotide sequence of the entire contig is downloaded. in this example my gene of interest to download has 830…

Continue Reading Download nucleotide sequence with locus_tag