Tag: FASTA

Ubuntu Manpage: bamfillquery – fill query sequences into BAM files

Provided by: biobambam2_2.0.179+ds-1_amd64 NAME bamfillquery – fill query sequences into BAM files SYNOPSIS bamfillquery [options] <in.bam queries.fasta >out.bam DESCRIPTION bamfillquery reads a SAM/BAM/CRAM file and a FastA file, copies the sequences found in the FastA file into the query sequence field of the SAM/BAM/CRAM file and writes the resulting data…

Continue Reading Ubuntu Manpage: bamfillquery – fill query sequences into BAM files

Ubuntu Manpage: samtools targetcut – cut fosmid regions (for fosmid pool only)

Provided by: samtools_1.13-2_amd64 NAME samtools targetcut – cut fosmid regions (for fosmid pool only) SYNOPSIS samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] in.bam DESCRIPTION This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and…

Continue Reading Ubuntu Manpage: samtools targetcut – cut fosmid regions (for fosmid pool only)

Pooled shRNA Library Screening to Identify Factors that Modulate a Drug Resistance Phenotype

High-throughput RNA interference (RNAi) screening using a pool of lentiviral shRNAs can be a tool to detect therapeutically relevant synthetic lethal targets in malignancies. We provide a pooled shRNA screening approach to investigate the epigenetic effectors in acute myeloid leukemia (AML). The overall goal of the following video is to…

Continue Reading Pooled shRNA Library Screening to Identify Factors that Modulate a Drug Resistance Phenotype

extendedSequences length is not the required for DeepCpf1 (34bp)

Hi, I’m using CRISPRseek dev v. 1.35.2, installed from github (hukai916/CRISPRseek). I wanted to calculate the CFD, and the grna efficacy of a Cas12 sgRNA (my_sgrna.fa file) using Deep Cpf1. my_sgrna.fa, TTTT (PAM) + sgRNA (20bp): >sgrna1 TTTTTGTCTTTAGACTATAAGTGC Command: offTargetAnalysis(inputFilePath = “my_sgrna.fa”, format = “fasta”, header = FALSE, exportAllgRNAs =…

Continue Reading extendedSequences length is not the required for DeepCpf1 (34bp)

Screen.seqs result varying – Commands in mothur

I have a data set of 2×150 reads of 54 pairs of 16S v4 metagenomic sequences from NCBI sra of gastritis patients. When I previously ran the sequences through mothur, the screen.seqs after silva alignment removed sufficient number of sequences. mothur > screen.seqs(fasta=current, count=current, start=2, end=13426)Using Ulcer_Donors\stability.trim.contigs.count_table as input file…

Continue Reading Screen.seqs result varying – Commands in mothur

man Bio::SeqIO::fasta (3): fasta sequence input/output stream

Bio::SeqIO::fasta(3) fasta sequence input/output stream SYNOPSIS Do not use this module directly. Use it via the Bio::SeqIO class. DESCRIPTION This object can transform Bio::Seq objects to and from fasta flat file databases. FEEDBACK Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules….

Continue Reading man Bio::SeqIO::fasta (3): fasta sequence input/output stream

clustalw and muscle in Biopython

First, try installing Biopython 1.63 from here, it may solve some of your problems. Second, make sure you’re using the latest Python from python.org – you might want to run the installer again just to ensure that none of your files are corrupted, if you’re still getting the same error…

Continue Reading clustalw and muscle in Biopython

How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

You really don’t need regular expressions for this. header = None length = 0 with open(‘file.fasta’) as fasta: for line in fasta: # Trim newline line = line.rstrip() if line.startswith(‘>’): # If we captured one before, print it now if header is not None: print(header, length) length = 0 header…

Continue Reading How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

Standard for aligning smallRNA to a reference human rRNA?

Standard for aligning smallRNA to a reference human rRNA? 0 Hi, I need to label some smallRNA sequences that I know are rRNA fragments. I know that for mRNA these are discarded by aligning to the human genome and filtering out multimapped reads, but I need to try to pin…

Continue Reading Standard for aligning smallRNA to a reference human rRNA?

Create a streamlit download_button to download a fasta file from a local Genbank file – Using Streamlit

Hi streamlit communityI’m building a streamlit app that allows the users to upload a full record genbank file and to explore its content (genes sequences, proteins sequences etc.) using biopython. Everything works perfectly except when I try to create a st.download_button() to download the hole genome sequence or a sequence…

Continue Reading Create a streamlit download_button to download a fasta file from a local Genbank file – Using Streamlit

Detailed differences between sambamba and samtools

3 month , My first post in the new student group , The false-positive mutation appears because duplicates mark Not enough ?, Tells the story of supplementary read It won’t be GATK MarkDuplicates Marked as duplicates The problem of . after , In response to this question , I began…

Continue Reading Detailed differences between sambamba and samtools

Reverse complement of fasta file

Reading records separated by > is a nice idea as it gives you the whole chunk at a time. However, here you want to process and merge lines but not the header, thus distinguishing between lines. It is clearer to read line by line. The sequence-line is specific: all caps…

Continue Reading Reverse complement of fasta file

downloading human rRNA.fasta

downloading human rRNA.fasta 1 I am trying to download human rRNA.fasta file. do you know where I can find this file? in one of the older post in this forum, someone said this file can be found on the UCSC but I did not manage. rRNA • 154 views •…

Continue Reading downloading human rRNA.fasta

BlastX through Biopython

BlastX through Biopython 0 I have an unknown gene segment in the Human_gene.txt file and I want to run blastx (translated nucleotide) using the blast module of Biopython by making the E-value threshold 0.0001 and displaying the match result of 50 residues of query and subject. I am trying this…

Continue Reading BlastX through Biopython

java – Calculating physico-chemical properties of amino acids in Biojava

I need to calculate the number and percentages of polar/non-polar, aliphatic/aromatic/heterocyclic amino acids in this protein sequence that I got from UNIPROT, using BioJava. I have found in the BioJava tutorial how to read the Fasta files and implemented this code. But I have no ideas how to solve this…

Continue Reading java – Calculating physico-chemical properties of amino acids in Biojava

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Article, 2014 In: Journal of Bioinformatics and Sequence Analysis, ISSN 2141-2464, Volume 6, 1, Pages 1-6, 2014 DOI:10.5897/ijbc2013.0086 Organisations Abstract Following advances in DNA and protein sequencing, the application of computational approaches in analysing biological data has become a very important aspect of biology. Evaluating similarities between biological sequences…

Continue Reading Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Recent questions tagged fasta – Q&A

Most popular tags python javascript html java css reactjs c# php r sql arrays pandas c++ android jquery DataFrame python-3.x node.js c mysql list flutter JSON ios typescript sql-server swift string angular regex laravel excel django dictionary dart bash numpy postgresql loops oracle vba linux angularjs function for-loop spring spring-boot…

Continue Reading Recent questions tagged fasta – Q&A

FastQ_7 April 2022(1) – Copy.pptx – What is the FASTA format? The FASTA format is the “workhorse” of bioinformatics. It is used to represent sequence

the FASTA format is not “officially” defined – even though it carries the majority of data information onliving systems. Its origins go back to asoftware tool calledFastawritten byDavidLipman(ascientist that later became, and still is, the director of NCBI) andWilliam R. Pearsonof the University ofVirginia. The tool itself has (to some…

Continue Reading FastQ_7 April 2022(1) – Copy.pptx – What is the FASTA format? The FASTA format is the “workhorse” of bioinformatics. It is used to represent sequence

On a reference pan-genome model (Part II)

12 July 2019 I wrote a blog post on a potential reference pan-genome model. I had more thoughts in my mind. I didn’t write about them because they are immature. Nonetheless, a few readers raised questions related to my immature thoughts, so I decide to add this “Part II” as…

Continue Reading On a reference pan-genome model (Part II)

fasta MSA Sequence input/output stream

Bio::AlignIO::fasta(3) fasta MSA Sequence input/output stream SYNOPSIS Do not use this module directly. Use it via the Bio::AlignIO class. DESCRIPTION This object can transform Bio::SimpleAlign objects to and from fasta flat files. This is for the fasta alignment format, not for the FastA sequence analysis program. To process the alignments…

Continue Reading fasta MSA Sequence input/output stream

NcbiblastpCommandline alignment results are different from blast webpage

What you are trying to do is fairly simple, and you are complicating it by: 1) not providing your sequences so that someone can reproduce your attempt; 2) giving a result in a form that is impossible to read. Be honest, can you make any sense of the result you…

Continue Reading NcbiblastpCommandline alignment results are different from blast webpage

All vs All blast not self hit? Orthogroup clustering and single copy genome?

Hey guys Self hit I have this actually a bit weird question about blast. I’ve been doing some work around single copy genome construction using Reciprocal best blast hit (RBBH) method. As I have something like 100+ annotated genome, I concatenated all annotated CDS into one fasta and makeblastdb with…

Continue Reading All vs All blast not self hit? Orthogroup clustering and single copy genome?

Merge.file do not like CAP letters – mothur bugs

Hello, I ran into this problem while running mothur on a server. mothur > merge.files(input=saraCPERF.trim.contigs.unique.good.good.filter.unique .precluster.denovo.vsearch.pick.fasta-combinedphyto.good.filter.unique.precluste r.denovo.vsearch.fasta, output=combined_saraCPERF.fasta) Unable to open combinedphyto.good.filter.unique.precluster.denovo.vsearch.fasta. Trying mothur’s executable directory combinedphyto.good.filter.unique.precluste r.denovo.vsearch.fasta. Unable to open combinedphyto.good.filter.unique.precluster.denovo.vsearch.fasta. Unable to open ▒!q▒cod.filter.unique.precluster.denovo.vsearch.fasta. Trying mot hur’s executable directory ‘qod.filter.unique.precluster.denovo.vsearch.fasta. Unable to open ‘qod.filter.unique.precluster.denovo.vsearch.fasta. free(): double free detected…

Continue Reading Merge.file do not like CAP letters – mothur bugs

Ubuntu Manpage: Bio::Tools::Seg – parse “seg” output

Provided by: libbio-perl-perl_1.7.2-2_all NAME Bio::Tools::Seg – parse “seg” output SYNOPSIS use Bio::Tools::Seg; my $parser = Bio::Tools::Seg->(-file => ‘seg.fasta’); while ( my $f = $parser->next_result ) { if ($f->score < 1.5) { print $f->location->to_FTstring, ” is low complexity\n”; } } DESCRIPTION “seg” identifies low-complexity regions on a protein sequence. It is…

Continue Reading Ubuntu Manpage: Bio::Tools::Seg – parse “seg” output

LOC125105370 sterile alpha motif domain-containing protein 1-like [Lutra lutra (Eurasian river otter)] – Gene

The following sections contain reference sequences that belong to a specific genome build. Explain This section includes genomic Reference Sequences (RefSeqs) from all assemblies on which this gene is annotated, such as RefSeqs for chromosomes and scaffolds (contigs) from both reference and alternate assemblies. Model RNAs and proteins are also…

Continue Reading LOC125105370 sterile alpha motif domain-containing protein 1-like [Lutra lutra (Eurasian river otter)] – Gene

Qiime2 Exclude Seqs with FASTQ as query data.

Qiime2 Exclude Seqs with FASTQ as query data. 0 Hello, I am working with FASTQ files and I want to filter them based on the alignment with references sequences in FASTA format. I decided to use QIIME2 for this. So I imported both FASTA and FASTQ files to the required…

Continue Reading Qiime2 Exclude Seqs with FASTQ as query data.

python – How are paths meant to be denoted on for Biopython on mac?

I am trying to run a basic biopython script to rename sequences within a fasta file. I have only ever ran this on a server; i am trying to do it on my macbook but I can’t work out what the correct path to the file should be. on the…

Continue Reading python – How are paths meant to be denoted on for Biopython on mac?

How to create a subset FASTA file of proteins of interest based on UniprotKB AC / Accession Numbers –

Hello, I am looking to create a subset FASTA file from an existing FASTA file. The subset file should only include entries with certain accession numbers. I have created a BioIndexed File with the correct number of entries, but I am unsure how to use the getsubset function in this…

Continue Reading How to create a subset FASTA file of proteins of interest based on UniprotKB AC / Accession Numbers –

Issues with searching Swissprot #25

Eddykay310 Hi @cruizperez Please help me understand the problem here and how I can fix it. I have successfully generated my DBs but I get this error during analysis. The .dmnd files do not exist in the folders as the error says but I don’t know how I can generate…

Continue Reading Issues with searching Swissprot #25

segregating sites calculation fails on gapped sequences #132

Cjfields Author Name: Jason Stajich (@hyphaltip) Original Redmine Issue: 3328, redmine.open-bio.org/issues/3328 Original Date: 2012-02-17 Original Assignee: Bioperl Guts I am Cheng-Ruei Lee, a graduate student in Duke Biology. I’m analyzing many DNA alignments of a plant species. I first used (Bio::PopGen::Utilities -> aln_to_population()) to read in the fasta format alignment,…

Continue Reading segregating sites calculation fails on gapped sequences #132

Questions tagged fasta – DevDreamz

Python Javascript Linux FAQ LoginSignup PUBLIC All Questions Tags Snippets Jobs splitlinuxpythonfasta dictionaryfastqpythonfasta pythonbioinformaticsbiopythonfasta pythonblastbiopythonfasta dictionarypythonbiopythonfasta pythonbioinformaticspairwisebiopythonfasta fastapython fastajavascriptphp bioinformaticspythonbiopythonfasta bioinformaticspythonbiopythonfasta PreviousNext Recent Posts show same id one time but in column count how many times php Assign bundle or argument to ImageView in Android ValueRequiredException during RSS feed parsing…

Continue Reading Questions tagged fasta – DevDreamz

Using Rsubread buildindex with GRCh37.p13.genome.fa.gz gives me an error

Using Rsubread buildindex with GRCh37.p13.genome.fa.gz gives me an error 0 @efernandez-22025 Last seen 1 day ago Argentina Hi I am triying to build the human index using ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz I am using Rsubread 2.4.3 an it gives me the following error //================================= Running ==================================\ || || || Check the integrity of…

Continue Reading Using Rsubread buildindex with GRCh37.p13.genome.fa.gz gives me an error

ClustalW on Ubuntu – DevDreamz

The section is copied from the BioPython documentation. >>> from Bio.Align.Applications import ClustalwCommandline>>> cline = ClustalwCommandline(“clustalw2″, infile=”opuntia.fasta”)>>> print(cline) clustalw2 -infile=opuntia.fasta If you run from Bio.Align.Applications import ClustalwCommandline cline = ClustalwCommandline(“clustalw2″, infile=”opuntia.fasta”) print(cline) it will do 3 things Import ClustalwCommandline module from BioPython Create a ClustalwCommandline object Print the object’s string…

Continue Reading ClustalW on Ubuntu – DevDreamz

Append assembly accession to nucleotide accession number in RefSeq Genbank file

Append assembly accession to nucleotide accession number in RefSeq Genbank file 0 Hi everyone, When I want to append the filename to the contig header in a multi-fasta file, I usually use for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=${N}_mod.fasta prefix=$F addprefix=t ; done However, this…

Continue Reading Append assembly accession to nucleotide accession number in RefSeq Genbank file

biopython – How can i write only a specific elements of the sequences, that i downloaded using Entrez.efetch, to the file( id and sequence itself)

I’m still a begginer at this. I downloaded 20 sequences from NCBI and my task is to allign them with themselves, but I need to separate the data, that I got using Entrez.efetch, so I could use it for allignment and I couldnt write the only specific elements (id and…

Continue Reading biopython – How can i write only a specific elements of the sequences, that i downloaded using Entrez.efetch, to the file( id and sequence itself)

MitotoolPy , shown no error but no results

MitotoolPy , shown no error but no results 0 Has anyone used MitoToolpy (www.mitotool.org/mp.html), a python script related to mitochondrial haplogroup classification? The official documentation claim that it only takes 50 seconds for one fasta file to get the result, but I haven’t gotten the result after running for one…

Continue Reading MitotoolPy , shown no error but no results

Creating local nt blast database : bioinformatics

Hi all, I’m trying to create a local nt blast database, my eventual goal is to create a subset based on a taxanomic group to be used on a cluster with limited storage space, its seems the only way to do this though is to start with the whole database…

Continue Reading Creating local nt blast database : bioinformatics

how to build index for cdna?

Hello, I can build index for Mus_musculus.GRCm38.dna_sm.toplevel.fa, but when build for Mus_musculus.GRCm38.cdna.all.fa, there is a bug: “rsem-extract-reference-transcripts Mus_musculus.GRCm38.cdna.all.fa 0 Mus_musculus.GRCm38.cdna.all.fa.gtf None 0 Mus_musculus.GRCm38.cdna.all” failed! Plase check if you provide correct parameters/options for the pipeline! Traceback (most recent call last): File “../indrops.py”, line 1770, in project.build_transcriptome(args.genome_fasta_gz, args.ensembl_gtf_gz, mode=args.mode) File “../indrops.py”, line…

Continue Reading how to build index for cdna?

moshi4/ANIclustermap: A tool for drawing ANI clustermap between all-vs-all microbial genomes using fastANI & seaborn

GitHub – moshi4/ANIclustermap: A tool for drawing ANI clustermap between all-vs-all microbial genomes using fastANI & seaborn This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. You can’t perform that action at this time. You signed in with…

Continue Reading moshi4/ANIclustermap: A tool for drawing ANI clustermap between all-vs-all microbial genomes using fastANI & seaborn

FastANI – BioGrids Consortium – Supported Software

AllHigh-Throughput SequencingGenomicsProteomicsVisualizationOther FastANI Description developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). Installation Use the following command to install this title with the CLI client: $ biogrids-cli install fastani Copy to clipboard Primary Citation* C. Jain, L. M. Rodriguez-R, A. M. Phillippy, K. T. Konstantinidis, and S….

Continue Reading FastANI – BioGrids Consortium – Supported Software

What is ClustalW? Tutorial of How to Use ClustalW

Share Tweet Share Share Email ClustalW is a computer tool of significant importance in bioinformatics. Primarily, biologists and statisticians used it for multiple sequence alignment. Many versions of ClustalW over the development of the algorithm are available now. How to perform a search on ClustalW? ClustalW homepage 1. Go to…

Continue Reading What is ClustalW? Tutorial of How to Use ClustalW

Convert bedGraph to Homer tag directory?

Convert bedGraph to Homer tag directory? 0 Hi, I am new to ChIP-seq analysis. When taking published data in .bedGraph format (generated by Homer), is there any way to convert back to Homer tag directory? (other than aligning from the raw .fasta). I suppose extracting columns into .bed format and…

Continue Reading Convert bedGraph to Homer tag directory?

Using salmon in Galaxy

Hi everyone. I am executing Salmon in Galaxy in order to carry out gene quantification from mouse RNA-Seq data (6 samples). To do so, I am providing a reference genome (cDNA, in fasta format), the processed reads (in fastqsanger.gz format) of one of these samples (after executing Trim-Galore) and a…

Continue Reading Using salmon in Galaxy

Trimmomatic/ linux system

Trimmomatic/ linux system 1 Hi all, I am trying to remove adapters and clean my RNA-seq.gz files using Trimmomatic, loaded on a Linux system (supercomputer server) Following the steps for Pair ends reads, explained in the manual (www.usadellab.org/cms/?page=trimmomatic) java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:3…

Continue Reading Trimmomatic/ linux system

bioinformatics – how to replace seqIDs in a fasta file with new seqIDs using biopython

I have a fasta file that reads like so: >00009c1cc42953fb4702f6331325c7cc TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGTTAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTGAGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAGAACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC >000118a5e731455e942c61a82a40367a623088d0 AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGACGGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGTGAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACACAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGACATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGGAGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGGATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTACCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGAAAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAGATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCGAAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGTCAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTCGGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTAAACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATGTGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATGAAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT And I want to basically add microbial taxonomy to the seq IDs like so: d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0 Where the original seqID is appended to the taxonomy…

Continue Reading bioinformatics – how to replace seqIDs in a fasta file with new seqIDs using biopython

Optimize a script that extract features from Fasta file using biopython

Hey, I have a script that extract features from a large fasta file (1767 MB) using biopython. I am sending it as a bash job via ssh remote server. The job is running for two days now.. Is there a way to optimize my script? I think maybe the problem…

Continue Reading Optimize a script that extract features from Fasta file using biopython

subsample fasta to certain size

subsample fasta to certain size 1 Hi there, Can anyone suggest a tool or method to extract random 10GB reads with minimum read length of (1000bp) from a huge 100 Gb file. I have 50 different fa.gz files with varying size (20 -100GB) and I like to subsample fasta with…

Continue Reading subsample fasta to certain size

“No such file or directory: ‘test.xml”

Biopython NcbiblastpCommandline not working: “No such file or directory: ‘test.xml” 0 from Bio.Blast.Applications import NcbiblastpCommandline blastp=r”C:\NCBI\blast-BLAST_VERSION+\bin\blastp.exe” blastp_cline = NcbiblastpCommandline(blastp, query=r”C:/NCBI/blast-BLAST_VERSION+/bin/test.fasta”, db=r’C:/NCBI/blast-BLAST_VERSION+/bin/bos_protein.fasta’, outfmt=5, evalue=0.00001, out=r”C:/NCBI/blast-BLAST_VERSION+/bin/test.XML”) blastp_cline from Bio.Blast import NCBIXML with open(“test.XML”) as result_handle: E_VALUE_THRESH=0.01 blast_records = NCBIXML.parse(result_handle) blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect…

Continue Reading “No such file or directory: ‘test.xml”

How to check Fasta file ASCII characters and fix encoding errors?

How to check Fasta file ASCII characters and fix encoding errors? 0 I tried building a diamond database but got this error. Error: Error reading input stream at line 180825: Invalid character (ASCII 0) in sequence How can I fix it? Is there a tool that checks for this and…

Continue Reading How to check Fasta file ASCII characters and fix encoding errors?

Low transcript quantification with Salmon using GRCm39 annotations

Hi everyone, first time working with mouse samples and unfortunately, there are fewer resources available for the latest mouse Ensembl genome than I was expecting. What I’ve done: I performed rRNA depletion on total RNA extracted from mouse tissue and created Illumina libraries using a cDNA synthesis kit with random…

Continue Reading Low transcript quantification with Salmon using GRCm39 annotations

Feature count is very low using htseq-count

Feature count is very low using htseq-count 0 Hello all, I performed bbmap on my RNA-seq paired sequence data using following cmd bbmap.sh in1=J2_R1.fastq in2=J2_R2.fastq out=output_J2.sam ref=im4.fasta nodisk The header of generated sam file is @HD VN:1.4 SO:unsorted @SQ SN:k141_1006 LN:2503 @SQ SN:k141_5512 LN:5393 @SQ SN:k141_4772 LN:4387 @SQ SN:k141_3267 LN:4531…

Continue Reading Feature count is very low using htseq-count

Minimap2 options for Nanopore cDNA direct seq

Minimap2 options for Nanopore cDNA direct seq 0 Hello, I’m working with ONT RNA seq data and I used the cDNA direct seq to do the seq. I want to look for long deletions in mRNAs that are not spliced, for this, I want to use the splice option of…

Continue Reading Minimap2 options for Nanopore cDNA direct seq

Search for specific motif in MEME analysis

Search for specific motif in MEME analysis 1 Hello! I am looking into using the MEME suite to answer some questions about VDR motifs in L1 genes. I am able to use MEME to search for motifs in my fasta data with the web-based tool, where the command would look…

Continue Reading Search for specific motif in MEME analysis

BBTools – BioGrids Consortium – Supported Software

AllHigh-Throughput SequencingGenomicsProteomicsVisualizationOther BBTools Description a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. Installation Use the following command to…

Continue Reading BBTools – BioGrids Consortium – Supported Software

need to add unique ids with accession number in multiple fasta refseq files

need to add unique ids with accession number in multiple fasta refseq files 0 i need to add my unique ids (that i have created) to accession numbers in fasta files. the unique set of ids are given in a csv file with column1 having unique ids, column2 having fasta…

Continue Reading need to add unique ids with accession number in multiple fasta refseq files

transcriptome – How to combine multiple .fasta files of primary assembly from Ensembl into one for sequence alignment?

I have some marmoset snRNA reads that I want to align with the reference transcriptome using cellranger. The primary assembly for marmoset is available here, which is broken down into 22 parts. However, cellranger mkref only accepts one .fa file to generate the transcriptome. I tried concatentaing all the extracted…

Continue Reading transcriptome – How to combine multiple .fasta files of primary assembly from Ensembl into one for sequence alignment?

How to extract fasta sequences from assembled transcripts generated by Stringtie

How to extract fasta sequences from assembled transcripts generated by Stringtie 4 Hi all, I used STAR and stringtie for mapping reads to reference genome and assembly. As you know, the generated assembled transcripts by stringtie are in gtf format. Now, I want to have fasta sequence of assembled transcript….

Continue Reading How to extract fasta sequences from assembled transcripts generated by Stringtie

biopython – How to blastp with fasta file that contains ~50 sequences

I’m trying to blastp multiple aminoacids sequences using biopython. I just can’t seem to get it right and i cant figure out the handbook for how to do this. I have come up with the following: open(“proteins_PROT.fasta”,”r”) from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query=”proteins_PROT.fasta”, db=”nr”, evalue=0.001, remote=True, ungapped=True) NcbiblastpCommandline(cmd=’blastp’, query=”proteins_PROT.fasta”,…

Continue Reading biopython – How to blastp with fasta file that contains ~50 sequences

bedtools sample with fastq input and fewer input records than requested

I’m using bedtools sample to sample reads from fastq files. I’d like to submit two feature requests: If the number of requested records is larger than the input I get ERROR: Input file has fewer records than the requested number of output records. I guess this is intentional and not…

Continue Reading bedtools sample with fastq input and fewer input records than requested

UMD Genome group

An email was successfully sent. MaSuRCA assembler MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454, Pacbio and…

Continue Reading UMD Genome group

nf-core/circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data Introduction nf-core/circrna is a best-practice analysis pipeline for the quantification, miRNA target prediction and differential expression analysis of circular RNAs in paired-end RNA sequencing data. The pipeline is built using Nextflow, a workflow tool to run tasks across…

Continue Reading nf-core/circrna

Using AnnoTree to Get More Assignments, Faster, in DIAMOND+MEGAN Microbiome Analysis

INTRODUCTION Next-generation sequencing (NGS) has revolutionized many areas of biological research (1, 2), providing ever-more data at an ever-decreasing cost. One such area is microbiome research, the study of microbes in their theater of activity using metagenomic sequencing (3). Here, deep short-read sequencing, and improving performance of long-read sequencing, are…

Continue Reading Using AnnoTree to Get More Assignments, Faster, in DIAMOND+MEGAN Microbiome Analysis

Clustal Processing Massive Dataset

Hello wonderful beings of bioinformatics! I’m new to this world and could use some help. My job is to run multiple sequence alignment on a large dataset. I am looking into the L1 family of genes and wanting to compare 7,525 elements of full length sequences. Each sequence is ~6,000…

Continue Reading Clustal Processing Massive Dataset

Butterfly eyespots evolved via cooption of an ancestral gene-regulatory network that also patterns antennae, legs, and wings

Although the hypothesis of gene-regulatory network (GRN) cooption is a plausible model to explain the origin of morphological novelties (1), there has been limited empirical evidence to show that this mechanism led to the origin of any novel trait. Several hypotheses have been proposed for the origin of butterfly eyespots,…

Continue Reading Butterfly eyespots evolved via cooption of an ancestral gene-regulatory network that also patterns antennae, legs, and wings

Fasta File Python

Fasta File Python 2 How do I go about extracting elements from a fasta file. For example, if I want a list of all the IDS and then length of a sequence in another list how do I do that in base python without using any libraries? for line in…

Continue Reading Fasta File Python

Processing two lists of files with snakemake

I want to use snakemake to do bowtie2 mapping of split read files to a reference genome, and I’d like that rule to be integrated in the general workflow. For that purpose, I first defined a rule to create a bowtie index rule build_bowtie_index: input: referenceGenomeFasta output: expand(“{name}.{index}.bt2”, index=range(1,5), name…

Continue Reading Processing two lists of files with snakemake

Correction to: FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy | BMC Bioinformatics

Following publication of the original article [1], the authors identified that the affiliations of Giuseppe Cattaneo and Raffaele Giancarlo were interchanged. The correct affiliations are given below. The correct affiliation of Giuseppe Cattaneo is: 2Dipartimento di Informatica, Università di Salerno, Fisciano, Italy. The correct affiliation of Raffaele Giancarlo is: 3Dipartimento…

Continue Reading Correction to: FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy | BMC Bioinformatics

r – Displaying BIC and AICs from MSA nucleotide multisequence fasta file

I am trying to replicate MEGA (models/find best DNA protein models) using R. After reading R AICs and BIC documentation I can’t understand how I can implement it. How can I implement AICs and BICs without having to especify the number of sequences in the fasta file (in case that…

Continue Reading r – Displaying BIC and AICs from MSA nucleotide multisequence fasta file

java – GATK: HaplotypceCaller IntelPairHmm only detecting 1 thread

I can’t seem to get GATK to recognise the number of available threads. I am running GATK (4.2.4.1) in a conda environment which is part of a nextflow (v20.10.0) pipeline I’m writing. For whatever reason, I cannot get GATK to see there is more than one thread. I’ve tried different…

Continue Reading java – GATK: HaplotypceCaller IntelPairHmm only detecting 1 thread

Trouble with bedtools getfasta

Trouble with bedtools getfasta 0 I am trying to extract sequences from a .fasta file based on a bed file using bedtools getfasta and I am getting the following error. The command run was the following: bedtools getfasta -fi genomic.fasta -bed bedfile.bed -fo output.fasta WARNING. chromosome (chr1) was not found…

Continue Reading Trouble with bedtools getfasta

Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2) Install $ git clone github.com/bakerwm/find_te_ins.git&#13; $ cd find_te_ins Change the following variables upon your condition: genome_fa and te_fa in line-10 and line-11; $ bash run_pipe.sh run_pipe.sh Prerequisite minimap2 – 2.17-r974-dirty, align long…

Continue Reading Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Bioinformatics script using Python/Biopython/Clustalw using stdout to iterate over a directory of proteins

What exactly is the error you are seeing? You shouldn’t set sys.sterr and sys.stdout to string values (the clustalw_cline() function returns the clustal stderr and stdout as strings), as you won’t be able to write anything to stdout from python. I tried to clean up and correct your code below….

Continue Reading Bioinformatics script using Python/Biopython/Clustalw using stdout to iterate over a directory of proteins

Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

There are four types of methods to extract longest transcript or longest CDS regeion with longest transcript from transcripts fasta file or GTF file. 1.Extract longest transcript from gencode transcripts fasta file. 2.Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database. 3.Extract longest CDS regeion with longest…

Continue Reading Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

Replace sequences between files using Biopython

As you have written it, every time you write a new sequence, you’re overwriting the previous one. Try storing your records in a list and then writing out the list when the loop is completed. to_write = [] for seq1 in SeqIO.parse(r”c:UsersSergioDesktopnsp.fasta”, “fasta”): for seq2 in SeqIO.parse(r”c:UsersSergioDesktopwsp.fasta”, “fasta”): if seq2.id…

Continue Reading Replace sequences between files using Biopython

Mapping to multiple references using bbmap

So my question comes in two parts: First of all is what I’m trying to do within reason given the tools I am using? I am investigating the shuffling effects of a recombinase on a known reporter sequence which subsequently generates libraries of unique sequences. By simulating all of the…

Continue Reading Mapping to multiple references using bbmap

Ensembl VEP gnomAD annotated allele frequencies different from gnomAD browser

I’ve annotated some variants using VEP, and was looking at the minor allele frequencies. Some of the variants had very different MAFs in the annotation than I expected (I expected MAF < 1%, whereas some annotated MAFs were >50%). I looked up the same variants on the gnomAD v3 browser,…

Continue Reading Ensembl VEP gnomAD annotated allele frequencies different from gnomAD browser

Bioconductor on Microsoft Azure – Microsoft Tech Community

Co-authored by: Nitesh Turaga – Scientist at Dana Farber/Harvard, Bioconductor Core Team Erdal Cosgun – Sr. Data Scientist at Microsoft Biomedical Platforms and Genomics team Vincent Carey – Professor at Harvard Medical School, Bioconductor Core Team   Introduction   The Bioconductor project promotes the statistical analysis and comprehension of current and emerging…

Continue Reading Bioconductor on Microsoft Azure – Microsoft Tech Community

How to print the first few records using SeqIO from Biopython

There are numerous ways to do this. The most similar to your current structure would be to add a break when the index hits 19 (that is the 20th number since counting starts at 0): from Bio import SeqIO for index, record in enumerate(SeqIO.parse(“e_coli_k12_dh10b.faa”, “fasta”)): print(record.description, len(record.seq)) if index ==…

Continue Reading How to print the first few records using SeqIO from Biopython

How to identify different protein domains using HHpred

How to identify different protein domains using HHpred 0 I am new to HHpred/ analyzing proteins so bare with me. I have been given an uncharacterized protein, whose FASTA sequence I have obtained from Uniprot. I am looking to do the following by using HHpred. So far, I have done…

Continue Reading How to identify different protein domains using HHpred

fasta | HowToFix

Doing this with normal file operations will be complicated, since FASTA sequences can be a variable number of lines. It’s best to use a library to parse the files, such as pyfastx import pyfastx fa1 = pyfastx.Fastx(‘file1.fasta’) fa2 = pyrastx.Fastx(‘file2.fasta’) for index, ((name1, seq1, comment1), (name2, seq2, comment2)) in enumerate(zip(fa1,…

Continue Reading fasta | HowToFix

Petabase-scale sequence alignment catalyses viral discovery

Serratus alignment architecture Serratus (v0.3.0) (github.com/ababaian/serratus) is an open-source cloud-infrastructure designed for ultra-high-throughput sequence alignment against a query sequence or pangenome (Extended Data Fig. 1). Serratus compute costs are dependent on search parameters (expanded discussion available: github.com/ababaian/serratus/wiki/pangenome_design). The nucleotide vertebrate viral pangenome search (bowtie2, database size: 79.8 MB) reached processing rates…

Continue Reading Petabase-scale sequence alignment catalyses viral discovery

bio-alignment from masyagin1998 – Github Help

Implementation of Needleman-Wunsch, Smith-Waterman, Hirschberg and affine bioinformatics algorithms for alighning biological sequences. Tech Algorithm is coded in pure C89 without any dependencies. Installation bio-alignment requires only C89-compatible compiler and make utility. $ cd bio-alignment $ make $ ./bin/bio-alignment –help $ ./bin/bio-alignment -i data/in.fasta -o out.fasta -s blosum62 -g -5…

Continue Reading bio-alignment from masyagin1998 – Github Help

Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][[“seq”]] : subscript out of bounds in R)

I have a bed file which contains DNA sequences information as follow: ** track name=”194″ description=”194 methylation (sites)” color=0,60,120 useScore=1 chr1 15864 15866 FALSE 894 + chr1 534241 534243 FALSE 921 – chr1 710096 710098 FALSE 729 + chr1 714176 714178 FALSE 12 – chr1 720864 720866 FALSE 988 -…

Continue Reading Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][[“seq”]] : subscript out of bounds in R)

FASTA Sequences for mutant alleles : bioinformatics

Background: I’m trying to run AlphaFold on an ACT1 allele in yeast e.g. www.yeastgenome.org/allele/act1-105. It has been sequenced, and it has two known amino-acid mutations (E311A, R312A).  My question is: Is there a database that has the .fasta sequence for such alleles, which include the mutations? I can get the fasta sequence…

Continue Reading FASTA Sequences for mutant alleles : bioinformatics

biopython – Help to create a dataframe in Python from a FASTA file

I want to create a dataframe in Python starting from a FASTA format file. Given the toy FASTA file that I am attaching, I built this program in Python that returns four colums corresponding to id, sequence length, sequence, animal name and rows corresponding to all the data available. However,…

Continue Reading biopython – Help to create a dataframe in Python from a FASTA file

Failure to detect mutations in U2AF1 due to changes in the GRCh38 reference sequence

Materials and Methods Genomic data was collected as part of the MDS National History Study or The Cancer Genome Atlas project and consented appropriately under those protocols 8 Sekeres M.A. Gore S.D. Stablein D.M. DiFronzo N. Abel G.A. DeZern A.E. Troy J.D. Rollison D.E. Thomas J.W. Waclawiw M.A. Liu J.J….

Continue Reading Failure to detect mutations in U2AF1 due to changes in the GRCh38 reference sequence

a Rust-backed Python library for DNA translation that is up to 100x faster than Biopython : bioinformatics

Background: I work at SecureDNA1, where we use Biopython pretty extensively. It’s a great library, but often quite slow, and we’ve run into bottlenecks in our processing pipelines around Biopython’s translation speed. I wrote this library to augment Biopython — you can read your sequences out of FASTA files with…

Continue Reading a Rust-backed Python library for DNA translation that is up to 100x faster than Biopython : bioinformatics

Extracting organism and seq from fasta

Extracting organism and seq from fasta 0 Hi, I am trying to extract sequences from a fasta file from a database with a specific organism species keyword from a .txt file containing the relevant headers. Do you know how I can do this in python as the biopython guide I’ve…

Continue Reading Extracting organism and seq from fasta

Fasta file reading python

Answer by Aidan Golden I think you can just use Biopython,It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.,Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.,Nicely…

Continue Reading Fasta file reading python

Sequence extraction

Sequence extraction 1 Hello, I have a fasta file that contains sequences of different lengths. I want to extract the base sequences greater than 500 and less than 10000bp and regenerate a fasta file. What should I do? Thanks a lot if anyone can help. extraction Sequence • 79 views…

Continue Reading Sequence extraction

Using Minimap2 with FMLRC2

Using Minimap2 with FMLRC2 1 Hello all, I am using FMLRC2 (github.com/HudsonAlpha/rust-fmlrc) to correct PacBio reads with Illumina reads for hybrid genome assembly. Since FMLRC2 only corrects reads (does not do any assembly) another program is needed. In the paper published on FMLRC minimap (now succeeded by minimap2, github.com/lh3/minimap2) was…

Continue Reading Using Minimap2 with FMLRC2

[lh3/minimap2] Memory leak when using Python and threads

The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond. This issue was first discovered in bonito and isolated to mappy. The data flow…

Continue Reading [lh3/minimap2] Memory leak when using Python and threads

Bwa on multiple processor

Hi Guys, When I am trying to run bwa mem on multiple processor, I am getting error as : > mpirun -np 16 bwa mem hg19-agilent.fasta R1.fastq R2.fastq | samtools sort -o aln.bam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read…

Continue Reading Bwa on multiple processor

processing in strelka2 with multiples bam file in directory

processing in strelka2 with multiples bam file in directory 0 If I manually tell strelka2 to use these three bam files below, then I get the desired results of 3 individually genome files in results/variants. xxx_00.bam yyy_01.bam zzz_02.bam ${path_to_strelka}/bin/configureStrelkaGermlineWorkflow.py –bam xxx_00.bam –bam yyy_01.bam –bam zzz_02 –referenceFasta <fasta> –callRegions <.bed.gz> –runDir…

Continue Reading processing in strelka2 with multiples bam file in directory

MARS seq alingment

MARS seq alingment 0 Hello everyone, new here and also new to the field. was asked to create a pipeline for RNA seq and after two months of self learning of how to interact with each code im stuck with the program STAR. what im trying to do for now…

Continue Reading MARS seq alingment

Secret BBMAP helper page – HRGV/Marmics_Metagenomics Wiki

#How to map to the assembled scaffolds.fasta bbmap is a powerful and highly flexible read mapper jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/. For the upcoming analysis you are not interested in the typical mapping output but in statistics on the coverage on every scaffold, you can get them with scaffstats. We want to be specific…

Continue Reading Secret BBMAP helper page – HRGV/Marmics_Metagenomics Wiki

Alignment report

Alignment report 0 Hi Guys, I did alignment of R1 and R2 fastq files with reference genome using bwa mem and got bam file. Now, I want to check whether the alignment is done correctly and alignment percentage,coverage etc. I run following command: bwa mem hg19.fasta R1.fastq R2.fastq | samtools…

Continue Reading Alignment report

makeblastdb creating multiple files of unexpectedly large sizes

I have a set of 100 amino acid sequences and I want to perform a BLASTP sesrch against the refseq_protein database. Accordingly I had set up the standalone version of BLAST (Version 2.11.0+) and downloaded the refseq_protein database from NCBI using the following code wget ftp.ncbi.nlm.nih.gov/refseq/release/complete/*.faa.gz The database gets downloaded…

Continue Reading makeblastdb creating multiple files of unexpectedly large sizes

hg38 Import custom reference upload error

Our version of TS is 5.12.2 When trying to upload new custom reference fasta (downloaded from ncbi ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, gunzipped and renamed to hg38.fasta) through “Import custom reference” in interface an error occures: “uploaded file size is incorrect” (to be honest the error was not shown in logs, because of TypeError…

Continue Reading hg38 Import custom reference upload error

sequence alignment – Help with MinION sequencing data species identification

Hi I’m new to bioinformatics and have just completed my first run on the MinION (long read sequencing Oxford Nanopore Technologies). I was hoping someone could direct me towards R packages, workflow, tutorials or guides that will help me identify species that are present in my sample mainly for fungi…

Continue Reading sequence alignment – Help with MinION sequencing data species identification