How can I obtain the DNA sequences of each CDS for several genbank files?
How can I obtain the DNA sequences of each CDS for several genbank files? 0 Hello, I want to obtain DNA sequences of all the CDS from multiple genbank files in one fasta file. I tried several solutions with Biopython but nothing is working for me. I tried for exemple…
Calculate GC content for entire chromosome
If you’re comfortable using Python, I’ve created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot. To use the script, make sure you have Biopython installed in your conda…
Biopython: Empowering Biologists with Computational Tools | by Everton Gomede, PhD | Nov, 2023
Introduction In the realm of modern biology, data analysis and computational techniques have become indispensable for researchers to extract valuable insights from the vast amount of biological data generated today. Biopython is a powerful and versatile open-source software library designed to meet the computational needs of biologists and bioinformaticians. This…
How to count fasta sequences efficiently using (or not ) biopython
How to count fasta sequences efficiently using (or not ) biopython 6 This is not a very memory friendly way of counting sequences from a multi fasta, any ideas to improve this? generator = SeqIO.parse(“test_fasta.fasta”,”fasta”) sizes = [len(rec) for rec in SeqIO.parse(“test_fasta.fasta”, “fasta”)] I’m avoiding using tools like grep since…
Convert amino acid sequences into nucleotide sequences
Any packages to validate FASTA file?
Any packages to validate FASTA file? 1 I am trying to create a function that can take in a file and check to see if it’s a valid fasta file or not ( such as making sure there’s no leading tabs or spaces, the first character starts with ‘>’, no…
How to extract protein sequences from a .gff file
Hello everyone! I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database…
Introduction to Biopython
The powerful bioinformatics programme Biopython has become a standard resource for experts in the area. You are given an introduction to Biopython in this article, which also covers its installation and provides examples that demonstrate its use. Even though we’re going into Biopython, remember that it’s only a small part…
Remove sequences with (50% gaps) from MSA
Remove sequences with (50% gaps) from MSA 1 How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I’m looking for a simple script to identify alignments within my MSA that have >50% gaps “-” and…
FASTQ Phred33 average base quality score
FASTQ Phred33 average base quality score 2 I have a FASTQ dataset where I’m trying to find the average base quality score. I found this old link that helped somewhat ( Here is my script (I’m trying to stick to awk, bioawk or python): bioawk -c fastx ‘{print “>”$name; print…
Alignment error using Biopython
Alignment error using Biopython 1 Hello, i am trying to write a program using biopython that will align some sequences from a fasta file (for the test that i will present 5 of them) against a fasta file containing a genome. For each, gene-genome alignment i want the score of…
TCTTCTC) in a reference genome
Hello everyone, here’s my question: I would like to get all the genomic coordinates relative to a very small sequence (TCTTCTC) in a reference genome. I am aware that this would result in around 200,000 coordinates or so. I tried with blastn and with an alignent with bowtie/bwa, however the…
how to sort a fasta file
Using Python: This code reads your FASTA file, stores the entries in a dictionary and writes them back to a new FASTA file in a sorted order. It assumes that your FASTA file is formatted properly with each sequence header preceded by a “>”. from Bio import SeqIO # read…
Solved Need help with ErrorValue in coding (python)
Need help with ErrorValue in coding (python) from Bio import SeqIO from Bio.SeqUtils import molecular_weight #Read the fasta file records = list(SeqIO.parse(“your_file.fasta”, “fasta”)) # Filter sequences based on length filtered_records = [record for record in records if 2000 <= len(record.seq) <= 4500] # Translate and calculate molecular weight for each…
Solved You can use the script from
You can use the script from Practice working with FASTQ files as a starting point for your function. #!/usr/bin/env python3 # “””Interleave mate-pair sequences into a single file. Convert from FASTQ *.R1.fastq and *.R2.fastq files to one FASTA *.interleaved.fasta file. “”” from Bio import SeqIO leftReads = SeqIO.parse(“data/top24_Aip02.R1.fastq”,…
How to parse a gene’s location using biopython
How to parse a gene’s location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez”” # example is E. coli K-12 reference sequence handle =…
Parsing gene location using biopython
Parsing gene location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez”” # example is E. coli K-12 reference sequence handle = Entrez.efetch(db=”nuccore”, id=’NC_000913′, rettype=”fasta_cds_na”,…
Solved below code is giving error . please assist. To
below code is giving error . please assist. To obtain the human protein sequences in multiple FASTA format, you can use the following script: I have written the code in Python: # Load necessary modules from Bio import SeqIO import gzip # Read in human genome file genome_file=”hg38.fa.gz” with,…
(1): download the human genome below is the open
(1): download the human genome below is the open source site to get the human genome data for learning purpose site so no issue to access and download genome data. sample data attached below / anyone can access it open source no copyright hence sharing . (2): Use the…
Solved Problem Statement: This question asks to write a
Problem Statement: This question asks to write a script to obtain all protein sequences coded in the human genome in the multiple FASTA format, using the RefSeq table obtained from the UCSC Table Browser and the human genome obtained from the given URL. The ID of each sequence should be…
Difference between python and biopython
Biopython vs Python Hi, Please help – If already have python 3 in laptop, is biopython still needed to download? I already downloaded python 3; when I checked on the, there is also “download”, are they the same or different? sorry if it is a naive question. Thank you…
‘SeqRecord’ object has no attribute ‘transcribe’
‘SeqRecord’ object has no attribute ‘transcribe’ 1 I am learning how to use python and I need to get the RNA sequence from the DNA sequences of a Multi-Fasta file, but when I try to do it I get the same error. Here is my code: from Bio import SeqIO…
biopython – Parsing a gene bank file and outputting specific feature information to a csv using Bio Python
So I am trying to parse through a gene bank file, extract particular feature information and output that information to a csv file. The example gene bank file looks like this: SBxxxxxx.LargeContigs.gbk LOCUS scaffold_31 38809 bp DNA UNK 01-JAN-1980 DEFINITION scaffold_31. ACCESSION scaffold_31 VERSION scaffold_31 KEYWORDS . SOURCE . ORGANISM…
How can I print and write the strain /isolate/voucher number of a SeqRecord objec in biopython?
The isolate is a qualifier of the source feature that you can access like so: from Bio import SeqIO from pprint import pprint # Read genbank file for rec in SeqIO.parse(“”, “genbank”): source = rec.features[0] pprint(source.qualifiers) will print: OrderedDict([(‘organism’, [‘Amauroderma calcitum’]), (‘mol_type’, [‘genomic DNA’]), (‘isolate’, [‘FLOR 50931’]), (‘db_xref’, [‘taxon:1774182’]), (‘country’,…
Parsing GenBank file: get locus tag vs product
As your sample GenBank file was incomplete, I went online to find a sample file that could be used in an example, and I found this file. Using this code and the Bio::GenBankParser module, it was parsed guessing what parts of the structure you were after. In this case, “features”…
Append assembly accession to nucleotide accession number in RefSeq Genbank file
Append assembly accession to nucleotide accession number in RefSeq Genbank file 0 Hi everyone, When I want to append the filename to the contig header in a multi-fasta file, I usually use for F in *.fasta; do N=$(basename $F .fasta) ; in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done However, this…
bioinformatics – how to replace seqIDs in a fasta file with new seqIDs using biopython
Optimize a script that extract features from Fasta file using biopython
Hey, I have a script that extract features from a large fasta file (1767 MB) using biopython. I am sending it as a bash job via ssh remote server. The job is running for two days now.. Is there a way to optimize my script? I think maybe the problem…
Analyzing and slicing FASTQ file entries using Python
Analyzing and slicing FASTQ file entries using Python 1 I have the code pasted below for running on FASTQ file entries in order to compare specific parts and remove the redundancy of the same sequences (based on the miRNA + umi_seq combination). I save the entry IDs and then make…
Fasta File Python
Fasta File Python 2 How do I go about extracting elements from a fasta file. For example, if I want a list of all the IDS and then length of a sequence in another list how do I do that in base python without using any libraries? for line in…
Bioinformatics script using Python/Biopython/Clustalw using stdout to iterate over a directory of proteins
What exactly is the error you are seeing? You shouldn’t set sys.sterr and sys.stdout to string values (the clustalw_cline() function returns the clustal stderr and stdout as strings), as you won’t be able to write anything to stdout from python. I tried to clean up and correct your code below….
Replace sequences between files using Biopython
As you have written it, every time you write a new sequence, you’re overwriting the previous one. Try storing your records in a list and then writing out the list when the loop is completed. to_write = [] for seq1 in SeqIO.parse(r”c:UsersSergioDesktopnsp.fasta”, “fasta”): for seq2 in SeqIO.parse(r”c:UsersSergioDesktopwsp.fasta”, “fasta”): if…
How to print the first few records using SeqIO from Biopython
There are numerous ways to do this. The most similar to your current structure would be to add a break when the index hits 19 (that is the 20th number since counting starts at 0): from Bio import SeqIO for index, record in enumerate(SeqIO.parse(“e_coli_k12_dh10b.faa”, “fasta”)): print(record.description, len(record.seq)) if index ==…
SeqIO object get cleared away after being accessed
I’m using Biopython to parse a fastq file, and I found that the SeqIO object get cleared away once I accessed it. from Bio import SeqIO record_fastqIO = SeqIO.parse(‘SRR835775_1.first1000.fastq’,’fastq’) for record in record_fastqIO: print( This script works perfectly. But if I add one line to the script: from Bio import…
MultiProcessing on SeqIO biopython
MultiProcessing on SeqIO biopython 0 Hello, I would like to parse a wheat genome (13Gb) quickly, in order to cut each Sequence and count the fragment lengths and store it in a pandas dataframe. Is it recommendable to use multiprocessing on the SeqIO.parse command? Does it save time? Any experiences/recommendations…
Extracting organism and seq from fasta
Extracting organism and seq from fasta 0 Hi, I am trying to extract sequences from a fasta file from a database with a specific organism species keyword from a .txt file containing the relevant headers. Do you know how I can do this in python as the biopython guide I’ve…
Fasta file reading python
Answer by Aidan Golden I think you can just use Biopython,It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.,Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.,Nicely…
Question : Improve genbank feature addition
Question Improve genbank feature addition * 60 visibility 0 arrow_circle_up 0 arrow_circle_down I am trying to add more than 70000 new features to a genbank file using biopython. I have this code: from Bio import SeqIO from Bio.SeqFeature import SeqFeature, FeatureLocation fi = “myoriginal.gbk” fo = “mynewfile.gbk” for result in…
Biopython: Bio.SeqUtils.molecular_weight for a fasta file
I must write a function, given a file_name that can calculate the molecular weight of only the unambiguous sequences and gives as return sequence id and the corresponding molecular weight. I tried to use the Bio.SeqUtils.molecular_weight to calculate the molecular weight, but I couldn’t do it since SeqUtils.molecular_weight works with…
FastTree error while constructing tree
Hey All, I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative). My…
What is the correct syntax for BioPythons SeqIO.parse()
What is the correct syntax for BioPythons SeqIO.parse() 0 When reading in an assembly with BioPython’s SeqIO the tutorial indicated when reading in multiple records one should do the following: records = list(SeqIO.parse(“somefile.fasta”, “fasta”)) This produces the expected behaviour of a subscriptable list of records. However this syntax also functions…
Replace fasta header using bash : bioinformatics
Hello people, I got stucked with my new script and perhaps you can help me. Its goal is to take an input table with querys and subjects (originated by a local blast) and replace query names with subject names in the corresponding fasta file. In detail, the table input file…
Remote blast query limit
Remote blast query limit 0 Hello! How many blast queries can be processed by remote blast calls with biopython’s Bio.Blast.NCBIWWW.qblast or BLAST+ with -remote flag? When I go above 1 sequence I get the following message near the top of my XML results file (and no results: internal_error: (Severe Error)…
parsing gbk files (antismash result)
parsing gbk files (antismash result) 0 Hello I used antismash from the CLI and I got 700 gbk files (1 gbk file per each analyzed genome). I used the following script to retrieve the predicted products from the gbk files: from Bio import SeqIO import glob for files in glob.glob(“*.gbk”):…
Seqio.Parse Some Error
Seqio.Parse Some Error 2 I am a beginner in bioinformatics world. I am following exercise on biopython but i am stuck here. I am not sure why print command is not working. Please let me know to correct this step. > from Bio import SeqIO > for seq_record in SeqIO.parse(“…
Linearize fasta files
Program versions used: BBMap – v. 38.32Seqtk – v. 1.3-r106Seqkit – v. 0.8.1Perl – v. 5.16.3Python – v. 3.6.6sed – v. 2.2.2 $ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null) real 0m1.050s user 0m0.002s sys 0m1.045s With BBMap – $ time -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0) java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads…
Get chromosome sizes from fasta file
Get chromosome sizes from fasta file 4 Hello, I’m wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example. I know there’s the fetchChromSize program from UCSC, but…
Fastest way to perform BLAST search using a multi-FASTA file against a remote database
Fastest way to perform BLAST search using a multi-FASTA file against a remote database 0 I have a multi-FASTA file having ~125 protein sequences. I need to perform a BLASTP seach against remote nr database. I tried using NcbiblastpCommandline, but the issue is that it only accepts files as input….