Tag: Seqio.Parse

How can I obtain the DNA sequences of each CDS for several genbank files?

How can I obtain the DNA sequences of each CDS for several genbank files? 0 Hello, I want to obtain DNA sequences of all the CDS from multiple genbank files in one fasta file. I tried several solutions with Biopython but nothing is working for me. I tried for exemple…

Continue Reading How can I obtain the DNA sequences of each CDS for several genbank files?

Calculate GC content for entire chromosome

If you’re comfortable using Python, I’ve created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot. To use the script, make sure you have Biopython installed in your conda…

Continue Reading Calculate GC content for entire chromosome

Biopython: Empowering Biologists with Computational Tools | by Everton Gomede, PhD | Nov, 2023

Introduction In the realm of modern biology, data analysis and computational techniques have become indispensable for researchers to extract valuable insights from the vast amount of biological data generated today. Biopython is a powerful and versatile open-source software library designed to meet the computational needs of biologists and bioinformaticians. This…

Continue Reading Biopython: Empowering Biologists with Computational Tools | by Everton Gomede, PhD | Nov, 2023

How to count fasta sequences efficiently using (or not ) biopython

How to count fasta sequences efficiently using (or not ) biopython 6 This is not a very memory friendly way of counting sequences from a multi fasta, any ideas to improve this? generator = SeqIO.parse(“test_fasta.fasta”,”fasta”) sizes = [len(rec) for rec in SeqIO.parse(“test_fasta.fasta”, “fasta”)] I’m avoiding using tools like grep since…

Continue Reading How to count fasta sequences efficiently using (or not ) biopython

Convert amino acid sequences into nucleotide sequences

How to convert amino acid sequences (big fasta files) into nucleotide sequences, any software tool? I’m using a mac. In the fasta files I have the frames too. Here’s a little example: >abc_frame=-1 SEETQLVPLGWPR*W*PWCLSPSRKTSLDLWHSNTQQCLQAAHSVHLESQFCWKCLSRY* TCSLMNLCRMYIQ*ISFQSTPVLFLQAV*SNLCSSHQENKRPDR*WSDVDLAAQSQRSAV STVHPSHMIQLPTAAELQETWFVLNLTCCE >def_frame=-2 QRKHSWSLWGGRGDGDHGAFPPVVKTPIDSQYWHSNTQQCLQAAHSVHLESQFCWKCLSR Y*TCSLMNLCSMKLQ*ISFQSTPVLFLQAV*SKL*SSHQENKRPDR*WSDVDLAAQSQRS AVSTDHPSHMIQLPTAAELQETWFVLNLTCC >ghi_frame=3 SQHVRFSTNHVSCSSAAVGSWIXCEG*TVDTADLCDCAARSTSDHHLSGLLFSW*LLXXX DQTACRKRTGVDWNEIYWSFILQRFIKEQVQYRLRHFQQNCDSKWTECAA*RHCCVLLCQ PTGRGIXGFRLLGKRHTGNSVISHPKGTNCVSS Additional info: I have a fasta…

Continue Reading Convert amino acid sequences into nucleotide sequences

Any packages to validate FASTA file?

Any packages to validate FASTA file? 1 I am trying to create a function that can take in a file and check to see if it’s a valid fasta file or not ( such as making sure there’s no leading tabs or spaces, the first character starts with ‘>’, no…

Continue Reading Any packages to validate FASTA file?

How to extract protein sequences from a .gff file

Hello everyone! I am a beginner with bioinformatics but at the company I work at we have a genome assembly of one of our crops. I wanted to annotate the genome and to do so I used a piece of python code in ubuntu. I used the Augustus Arabidopsis database…

Continue Reading How to extract protein sequences from a .gff file

Introduction to Biopython

The powerful bioinformatics programme Biopython has become a standard resource for experts in the area. You are given an introduction to Biopython in this article, which also covers its installation and provides examples that demonstrate its use. Even though we’re going into Biopython, remember that it’s only a small part…

Continue Reading Introduction to Biopython

Remove sequences with (50% gaps) from MSA

Remove sequences with (50% gaps) from MSA 1 How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I’m looking for a simple script to identify alignments within my MSA that have >50% gaps “-” and…

Continue Reading Remove sequences with (50% gaps) from MSA

FASTQ Phred33 average base quality score

FASTQ Phred33 average base quality score 2 I have a FASTQ dataset where I’m trying to find the average base quality score. I found this old link that helped somewhat (www.biostars.org/p/47751/). Here is my script (I’m trying to stick to awk, bioawk or python): bioawk -c fastx ‘{print “>”$name; print…

Continue Reading FASTQ Phred33 average base quality score

Alignment error using Biopython

Alignment error using Biopython 1 Hello, i am trying to write a program using biopython that will align some sequences from a fasta file (for the test that i will present 5 of them) against a fasta file containing a genome. For each, gene-genome alignment i want the score of…

Continue Reading Alignment error using Biopython

TCTTCTC) in a reference genome

Hello everyone, here’s my question: I would like to get all the genomic coordinates relative to a very small sequence (TCTTCTC) in a reference genome. I am aware that this would result in around 200,000 coordinates or so. I tried with blastn and with an alignent with bowtie/bwa, however the…

Continue Reading TCTTCTC) in a reference genome

how to sort a fasta file

Using Python: This code reads your FASTA file, stores the entries in a dictionary and writes them back to a new FASTA file in a sorted order. It assumes that your FASTA file is formatted properly with each sequence header preceded by a “>”. from Bio import SeqIO # read…

Continue Reading how to sort a fasta file

Solved Need help with ErrorValue in coding (python)

Need help with ErrorValue in coding (python) from Bio import SeqIO from Bio.SeqUtils import molecular_weight #Read the fasta file records = list(SeqIO.parse(“your_file.fasta”, “fasta”)) # Filter sequences based on length filtered_records = [record for record in records if 2000 <= len(record.seq) <= 4500] # Translate and calculate molecular weight for each…

Continue Reading Solved Need help with ErrorValue in coding (python)

Solved You can use the writeInterleaved.py script from

You can use the writeInterleaved.py script from Practice working with FASTQ files as a starting point for your function. #!/usr/bin/env python3 # writeInterleaved.py “””Interleave mate-pair sequences into a single file. Convert from FASTQ *.R1.fastq and *.R2.fastq files to one FASTA *.interleaved.fasta file. “”” from Bio import SeqIO leftReads = SeqIO.parse(“data/top24_Aip02.R1.fastq”,…

Continue Reading Solved You can use the writeInterleaved.py script from

How to parse a gene’s location using biopython

How to parse a gene’s location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle =…

Continue Reading How to parse a gene’s location using biopython

Parsing gene location using biopython

Parsing gene location using biopython 0 Hi, I’m trying to extract gene location information for certain genes across multiple bacteria. Currently I’m using this set up to retrieve the gene information: from Bio import SeqIO, Entrez Entrez.email=”my@email.com” # example is E. coli K-12 reference sequence handle = Entrez.efetch(db=”nuccore”, id=’NC_000913′, rettype=”fasta_cds_na”,…

Continue Reading Parsing gene location using biopython

Solved below code is giving error . please assist. To

below code is giving error . please assist. To obtain the human protein sequences in multiple FASTA format, you can use the following script: I have written the code in Python: # Load necessary modules from Bio import SeqIO import gzip # Read in human genome file genome_file=”hg38.fa.gz” with gzip.open(genome_file,…

Continue Reading Solved below code is giving error . please assist. To

(1): download the human genome below is the open

(1): download the human genome below is the open source site to get the human genome data for learning purpose site so no issue to access and download genome data. sample data attached below / anyone can access it open source no copyright hence sharing . hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz (2): Use the…

Continue Reading (1): download the human genome below is the open

Solved Problem Statement: This question asks to write a

Problem Statement: This question asks to write a script to obtain all protein sequences coded in the human genome in the multiple FASTA format, using the RefSeq table obtained from the UCSC Table Browser and the human genome obtained from the given URL. The ID of each sequence should be…

Continue Reading Solved Problem Statement: This question asks to write a

Difference between python and biopython

Biopython vs Python Hi, Please help – If already have python 3 in laptop, is biopython still needed to download? I already downloaded python 3; when I checked on the www.bippython.org, there is also “download”, are they the same or different? sorry if it is a naive question. Thank you…

Continue Reading Difference between python and biopython

‘SeqRecord’ object has no attribute ‘transcribe’

‘SeqRecord’ object has no attribute ‘transcribe’ 1 I am learning how to use python and I need to get the RNA sequence from the DNA sequences of a Multi-Fasta file, but when I try to do it I get the same error. Here is my code: from Bio import SeqIO…

Continue Reading ‘SeqRecord’ object has no attribute ‘transcribe’

biopython – Parsing a gene bank file and outputting specific feature information to a csv using Bio Python

So I am trying to parse through a gene bank file, extract particular feature information and output that information to a csv file. The example gene bank file looks like this: SBxxxxxx.LargeContigs.gbk LOCUS scaffold_31 38809 bp DNA UNK 01-JAN-1980 DEFINITION scaffold_31. ACCESSION scaffold_31 VERSION scaffold_31 KEYWORDS . SOURCE . ORGANISM…

Continue Reading biopython – Parsing a gene bank file and outputting specific feature information to a csv using Bio Python

How can I print and write the strain /isolate/voucher number of a SeqRecord objec in biopython?

The isolate is a qualifier of the source feature that you can access like so: from Bio import SeqIO from pprint import pprint # Read genbank file for rec in SeqIO.parse(“genome.gb”, “genbank”): source = rec.features[0] pprint(source.qualifiers) will print: OrderedDict([(‘organism’, [‘Amauroderma calcitum’]), (‘mol_type’, [‘genomic DNA’]), (‘isolate’, [‘FLOR 50931’]), (‘db_xref’, [‘taxon:1774182’]), (‘country’,…

Continue Reading How can I print and write the strain /isolate/voucher number of a SeqRecord objec in biopython?

Parsing GenBank file: get locus tag vs product

As your sample GenBank file was incomplete, I went online to find a sample file that could be used in an example, and I found this file. Using this code and the Bio::GenBankParser module, it was parsed guessing what parts of the structure you were after. In this case, “features”…

Continue Reading Parsing GenBank file: get locus tag vs product

Append assembly accession to nucleotide accession number in RefSeq Genbank file

Append assembly accession to nucleotide accession number in RefSeq Genbank file 0 Hi everyone, When I want to append the filename to the contig header in a multi-fasta file, I usually use for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done However, this…

Continue Reading Append assembly accession to nucleotide accession number in RefSeq Genbank file

bioinformatics – how to replace seqIDs in a fasta file with new seqIDs using biopython

I have a fasta file that reads like so: >00009c1cc42953fb4702f6331325c7cc TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGTTAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTGAGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAGAACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC >000118a5e731455e942c61a82a40367a623088d0 AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGACGGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGTGAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACACAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGACATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGGAGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGGATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTACCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGAAAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAGATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCGAAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGTCAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTCGGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTAAACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATGTGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATGAAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT And I want to basically add microbial taxonomy to the seq IDs like so: d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0 Where the original seqID is appended to the taxonomy…

Continue Reading bioinformatics – how to replace seqIDs in a fasta file with new seqIDs using biopython

Optimize a script that extract features from Fasta file using biopython

Hey, I have a script that extract features from a large fasta file (1767 MB) using biopython. I am sending it as a bash job via ssh remote server. The job is running for two days now.. Is there a way to optimize my script? I think maybe the problem…

Continue Reading Optimize a script that extract features from Fasta file using biopython

Analyzing and slicing FASTQ file entries using Python

Analyzing and slicing FASTQ file entries using Python 1 I have the code pasted below for running on FASTQ file entries in order to compare specific parts and remove the redundancy of the same sequences (based on the miRNA + umi_seq combination). I save the entry IDs and then make…

Continue Reading Analyzing and slicing FASTQ file entries using Python

Fasta File Python

Fasta File Python 2 How do I go about extracting elements from a fasta file. For example, if I want a list of all the IDS and then length of a sequence in another list how do I do that in base python without using any libraries? for line in…

Continue Reading Fasta File Python

Bioinformatics script using Python/Biopython/Clustalw using stdout to iterate over a directory of proteins

What exactly is the error you are seeing? You shouldn’t set sys.sterr and sys.stdout to string values (the clustalw_cline() function returns the clustal stderr and stdout as strings), as you won’t be able to write anything to stdout from python. I tried to clean up and correct your code below….

Continue Reading Bioinformatics script using Python/Biopython/Clustalw using stdout to iterate over a directory of proteins

Replace sequences between files using Biopython

As you have written it, every time you write a new sequence, you’re overwriting the previous one. Try storing your records in a list and then writing out the list when the loop is completed. to_write = [] for seq1 in SeqIO.parse(r”c:UsersSergioDesktopnsp.fasta”, “fasta”): for seq2 in SeqIO.parse(r”c:UsersSergioDesktopwsp.fasta”, “fasta”): if seq2.id…

Continue Reading Replace sequences between files using Biopython

How to print the first few records using SeqIO from Biopython

There are numerous ways to do this. The most similar to your current structure would be to add a break when the index hits 19 (that is the 20th number since counting starts at 0): from Bio import SeqIO for index, record in enumerate(SeqIO.parse(“e_coli_k12_dh10b.faa”, “fasta”)): print(record.description, len(record.seq)) if index ==…

Continue Reading How to print the first few records using SeqIO from Biopython

SeqIO object get cleared away after being accessed

I’m using Biopython to parse a fastq file, and I found that the SeqIO object get cleared away once I accessed it. from Bio import SeqIO record_fastqIO = SeqIO.parse(‘SRR835775_1.first1000.fastq’,’fastq’) for record in record_fastqIO: print(record.id) This script works perfectly. But if I add one line to the script: from Bio import…

Continue Reading SeqIO object get cleared away after being accessed

MultiProcessing on SeqIO biopython

MultiProcessing on SeqIO biopython 0 Hello, I would like to parse a wheat genome (13Gb) quickly, in order to cut each Sequence and count the fragment lengths and store it in a pandas dataframe. Is it recommendable to use multiprocessing on the SeqIO.parse command? Does it save time? Any experiences/recommendations…

Continue Reading MultiProcessing on SeqIO biopython

Extracting organism and seq from fasta

Extracting organism and seq from fasta 0 Hi, I am trying to extract sequences from a fasta file from a database with a specific organism species keyword from a .txt file containing the relevant headers. Do you know how I can do this in python as the biopython guide I’ve…

Continue Reading Extracting organism and seq from fasta

Fasta file reading python

Answer by Aidan Golden I think you can just use Biopython,It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.,Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.,Nicely…

Continue Reading Fasta file reading python

Question : Improve genbank feature addition

Question Improve genbank feature addition * 60 visibility 0 arrow_circle_up 0 arrow_circle_down I am trying to add more than 70000 new features to a genbank file using biopython. I have this code: from Bio import SeqIO from Bio.SeqFeature import SeqFeature, FeatureLocation fi = “myoriginal.gbk” fo = “mynewfile.gbk” for result in…

Continue Reading Question : Improve genbank feature addition

Biopython: Bio.SeqUtils.molecular_weight for a fasta file

I must write a function, given a file_name that can calculate the molecular weight of only the unambiguous sequences and gives as return sequence id and the corresponding molecular weight. I tried to use the Bio.SeqUtils.molecular_weight to calculate the molecular weight, but I couldn’t do it since SeqUtils.molecular_weight works with…

Continue Reading Biopython: Bio.SeqUtils.molecular_weight for a fasta file

FastTree error while constructing tree

Hey All, I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative). My…

Continue Reading FastTree error while constructing tree

What is the correct syntax for BioPythons SeqIO.parse()

What is the correct syntax for BioPythons SeqIO.parse() 0 When reading in an assembly with BioPython’s SeqIO the tutorial indicated when reading in multiple records one should do the following: records = list(SeqIO.parse(“somefile.fasta”, “fasta”)) This produces the expected behaviour of a subscriptable list of records. However this syntax also functions…

Continue Reading What is the correct syntax for BioPythons SeqIO.parse()

Replace fasta header using bash : bioinformatics

Hello people, I got stucked with my new script and perhaps you can help me. Its goal is to take an input table with querys and subjects (originated by a local blast) and replace query names with subject names in the corresponding fasta file. In detail, the table input file…

Continue Reading Replace fasta header using bash : bioinformatics

Remote blast query limit

Remote blast query limit 0 Hello! How many blast queries can be processed by remote blast calls with biopython’s Bio.Blast.NCBIWWW.qblast or BLAST+ with -remote flag? When I go above 1 sequence I get the following message near the top of my XML results file (and no results: internal_error: (Severe Error)…

Continue Reading Remote blast query limit

parsing gbk files (antismash result)

parsing gbk files (antismash result) 0 Hello I used antismash from the CLI and I got 700 gbk files (1 gbk file per each analyzed genome). I used the following script to retrieve the predicted products from the gbk files: from Bio import SeqIO import glob for files in glob.glob(“*.gbk”):…

Continue Reading parsing gbk files (antismash result)

Seqio.Parse Some Error

Seqio.Parse Some Error 2 I am a beginner in bioinformatics world. I am following exercise on biopython but i am stuck here. I am not sure why print command is not working. Please let me know to correct this step. > from Bio import SeqIO > for seq_record in SeqIO.parse(“…

Continue Reading Seqio.Parse Some Error

Linearize fasta files

Program versions used: BBMap – v. 38.32Seqtk – v. 1.3-r106Seqkit – v. 0.8.1Perl – v. 5.16.3Python – v. 3.6.6sed – v. 2.2.2 $ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null) real 0m1.050s user 0m0.002s sys 0m1.045s With BBMap – reformat.sh $ time reformat.sh -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0) java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads…

Continue Reading Linearize fasta files

Get chromosome sizes from fasta file

Get chromosome sizes from fasta file 4 Hello, I’m wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example. I know there’s the fetchChromSize program from UCSC, but…

Continue Reading Get chromosome sizes from fasta file

Fastest way to perform BLAST search using a multi-FASTA file against a remote database

Fastest way to perform BLAST search using a multi-FASTA file against a remote database 0 I have a multi-FASTA file having ~125 protein sequences. I need to perform a BLASTP seach against remote nr database. I tried using NcbiblastpCommandline, but the issue is that it only accepts files as input….

Continue Reading Fastest way to perform BLAST search using a multi-FASTA file against a remote database