Categories
Tag: locus_tag
Renaming fasta files with their headers
Renaming fasta files with their headers 1 Hi I have around 85 gene sequences in individual fasta files. I’d like to rename each file with their header name containing the gene name in [gene=]. For each header, I only want what is in-between the brackets. I’m trying to do this…
Help with htseq -count read counts
Hello I am doing a transcriptome analysis on Pseudomonas putida and I have been trying to do a read count using Htseq -count. The program always give an error. I have tried different genome references (fna) and annotation files (gtf ang gff) but it does not work. The mapping works…
phage genome submission in ncbi genebank
phage genome submission in ncbi genebank 0 I annotated a bacteriophage genome using prokka against PHROGS database, and used table2asn to create a .sqn file. But it gave error messages like “SEQ_FEAT.GeneXrefWithoutGene, SEQ_FEAT.BadEcNumberFormat, SEQ_FEAT.BadProteinName” Error: valid [SEQ_FEAT.GeneXrefWithoutGene] Feature has gene locus_tag cross-reference but no equivalent gene feature exists FEATURE: tRNA:…
How do you add an ORF that overlaps the two regions where a circular genome is cut in Genbank?
How do you add an ORF that overlaps the two regions where a circular genome is cut in Genbank? 1 Dear Biostars community, I had a question regarding the BankIt (Genbank) submission for circularized genomes. Let’s say I have a circularized genome from 1 to 100000 bp. And I also…
Find data-based Gene_IDs for unknown gene_IDs in gtf.file
Find data-based Gene_IDs for unknown gene_IDs in gtf.file 0 Hi all, Following the RNA-seq analysis workflow, I am trying to find the GO gene ontology terms for a list of DGEs output of (FeatuCounts > edgeR). I conducted the RNA-seq analysis using either RAST-annotated gtf, or NCBI-PGAP gft files. 1…
GFF/GTF file error / featureCounts
Hi all, I am trying to generate a count.matrix for sorted bam files, using featureCounts on linux. I have a non-modal organism (bacteria), so I generated the annotation.file using both PROKKA and RAST. I used all the following files in featurecounts; PROKKA.gff, RAST.gff RAST.gtf gffread converted-PROKKA.gtf file But still facing…
Extract genes within a genome using eUtils
Extract genes within a genome using eUtils 0 Hi, I have a bunch of EC numbers based on which I would like to download the corresponding genes from several species. An example of a search would be “Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]” and if I use: esearch -db nuccore -query “Bacillus[ORGN]…
bash command to process a line
bash command to process a line 1 Hi, I have a weird .txt file with this line. lcl|CU459141.1_prot_CAM87240.1_2248 – TniQ PF06527.14 0.018 13.6 0.0 0.024 13.2 0.0 1.1 1 0 0 1 1 1 0 [locus_tag=ABAYE2390] [db_xref=EnsemblGenomes-Gn:ABAYE2390 I need to process the line into 2 columns like following: CU459141.1 CAM87240.1…
Error parsing strand (?) from GFF line
Error parsing strand (?) from GFF line 0 I am trying to assemble RNA transcripts using stringtie and facing the following error. Error parsing strand (?) from GFF line: NC_037304.1 RefSeq gene 58315 59481 . ? . ID=gene-DA397_mgp34;Dbxref=GeneID:36335702;Name=nad1;exception=trans-splicing;gbkey=Gene;gene=nad1;gene_biotype=protein_coding;locus_tag=DA397_mgp34;part=2 my comand is : stringtie -p 8 -G Genome/arab_thaliana.gtf -o Assemble/NR1.gtf –l…
How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix
How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix 0 Hi all, I checked the gtf.file for my reference genome (bacteria/ downloaded from NCBI), and it looks like there are missing some gene names but gene_IDs,…
Are there any tools that can create a very basic GTF file from contig sequences (no annotations really needed) ?
If anyone still needs help with this, you can use a SAF file as an option with featureCounts. Here’s a script from my VEBA suite github.com/jolespin/veba/blob/main/src/scripts/fasta_to_saf.py Can easily adapt to not require soothsayer_utils below. #!/usr/bin/env python from __future__ import print_function, division import sys, os, argparse import pandas as pd from…
How to parse a gene’s location using biopython
How to parse a gene’s location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle =…
Parsing gene location using biopython
Parsing gene location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle = Entrez.efetch(db=”nuccore”, id=’NC_000913′, rettype=”fasta_cds_na”,…
PROKKA.gff file is not compatible with featureCounts
Hi all, I am trying to count the number of reads that map to each gene using FeatureCounts. (RNA-Seq PE, linux) my input; GFF. file generated using Prokka GTF.file generated by NCBI annotation Sorted.bam files generated by bowtie2 and samtools. When I used gtf.file generated by NCBI, featurecounts run without…
genbank parsing using perl
genbank parsing using perl 1 I have genbank file. and want to retrieve relevant information. the problem is when the code fetches data related to product, it considers only one line information. since the while loop read data line by line. #! /usr/local/bin/perl -w open (GB,”$ARGV[0]”); open (AC, “>$ARGV[1]”); print…
gff file from NCBI RefSeq GCF dataset has an invalid format
Thank you for noticing this. It is indeed an issue in the GFF3 file. The root of the problem is it’s a gene that is impossible to correctly represent in GFF3 because it incorporates sequence from both strands via trans_splicing. The complexity of this gene can be seen on the…
“Error parsing strand (?) from GFF line” happenning in gffread, stringtie and cufflinks
“Error parsing strand (?) from GFF line” happenning in gffread, stringtie and cufflinks 0 Hi! I’m working with various genomic data and while trying to use gffread, stringtie and cufflinks I went through the same error: Error parsing strand (?) from GFF line: NC_037304.1 RefSeq gene 58315 59481 . ?…
How to identify locus_tag by using RefSeq protein info (WP_*)
How to identify locus_tag by using RefSeq protein info (WP_*) 0 Hi, I would like to know the locus tag of a protein annotated with RefSeq (WP_*). For example, I would like to identify the genomic location of a protein (WP_073031595.1) and also know its adjacent proteins. The GenBank file…
can gff2 reference used in htseq-count?
Dear all We are recently working with E.coli plasmid and tried to summarize the gene counts from our RNA-Seq samples. The short reads were mapped to E.coli plasmid using tophat which generated bam files accordingly. However, we were unable to obtain a gff3 version of our target plasmid genome, the…
Parsing GenBank file: get locus tag vs product
As your sample GenBank file was incomplete, I went online to find a sample file that could be used in an example, and I found this file. Using this code and the Bio::GenBankParser module, it was parsed guessing what parts of the structure you were after. In this case, “features”…
The meaning of greter than character (>) in gene position in Genbank files
The meaning of greter than character (>) in gene position in Genbank files 1 Hello.This character made some issues when I used Genbank files’ contents.Here an example of ‘>’ usage in a Genbabk file: gene 957467..>957886 /locus_tag=”BME_RS04610″ /old_locus_tag=”BMEI0926″ I couldn’t find what ‘>’ signifies. Does anyone knows? genbank • 120…
How to extract two genomic location numbers within the following fasta header?
How to extract two genomic location numbers within the following fasta header? 0 I am wondering how to extract the two numbers within the location tab of the following fasta header. >lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS] fasta extract location genomic bash • 42 views • link updated 34…
How to extract genomic upstream region of a protein identified by its NCBI accession number?
How to extract genomic upstream region of a protein identified by its NCBI accession number? 1 I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene’s nucleotide sequence. I will be thankful to you if you can…
does not contain a ‘gene’ attribute
htseq-count returns : does not contain a ‘gene’ attribute 1 Dear BIOSTAR community, I’m trying to make count matrix with htseq-count, htseq-count -s yes -t gene -i gene 01.sorted.sam annotation_cattle.gff > 01.txt even with –idattr=gene , it returns error: Error processing GFF file (line 1864255 of file annotation_cattle.gff): Feature gene-D1Y31_gp1…
Download nucleotide sequence with locus_tag
Download nucleotide sequence with locus_tag 1 I have a list of locus_tag, my idea was to download them using esearch but the downloaded file is not the desired gene, instead the nucleotide sequence of the entire contig is downloaded. in this example my gene of interest to download has 830…