Tag: FASTA

bio-alignment from masyagin1998 – Github Help

Implementation of Needleman-Wunsch, Smith-Waterman, Hirschberg and affine bioinformatics algorithms for alighning biological sequences. Tech Algorithm is coded in pure C89 without any dependencies. Installation bio-alignment requires only C89-compatible compiler and make utility. $ cd bio-alignment $ make $ ./bin/bio-alignment –help $ ./bin/bio-alignment -i data/in.fasta -o out.fasta -s blosum62 -g -5…

Continue Reading bio-alignment from masyagin1998 – Github Help

Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][[“seq”]] : subscript out of bounds in R)

I have a bed file which contains DNA sequences information as follow: ** track name=”194″ description=”194 methylation (sites)” color=0,60,120 useScore=1 chr1 15864 15866 FALSE 894 + chr1 534241 534243 FALSE 921 – chr1 710096 710098 FALSE 729 + chr1 714176 714178 FALSE 12 – chr1 720864 720866 FALSE 988 -…

Continue Reading Convert DNAStringSet to a list of elements in R? (Error in seq[[1]][[“seq”]] : subscript out of bounds in R)

FASTA Sequences for mutant alleles : bioinformatics

Background: I’m trying to run AlphaFold on an ACT1 allele in yeast e.g. www.yeastgenome.org/allele/act1-105. It has been sequenced, and it has two known amino-acid mutations (E311A, R312A).  My question is: Is there a database that has the .fasta sequence for such alleles, which include the mutations? I can get the fasta sequence…

Continue Reading FASTA Sequences for mutant alleles : bioinformatics

biopython – Help to create a dataframe in Python from a FASTA file

I want to create a dataframe in Python starting from a FASTA format file. Given the toy FASTA file that I am attaching, I built this program in Python that returns four colums corresponding to id, sequence length, sequence, animal name and rows corresponding to all the data available. However,…

Continue Reading biopython – Help to create a dataframe in Python from a FASTA file

Failure to detect mutations in U2AF1 due to changes in the GRCh38 reference sequence

Materials and Methods Genomic data was collected as part of the MDS National History Study or The Cancer Genome Atlas project and consented appropriately under those protocols 8 Sekeres M.A. Gore S.D. Stablein D.M. DiFronzo N. Abel G.A. DeZern A.E. Troy J.D. Rollison D.E. Thomas J.W. Waclawiw M.A. Liu J.J….

Continue Reading Failure to detect mutations in U2AF1 due to changes in the GRCh38 reference sequence

a Rust-backed Python library for DNA translation that is up to 100x faster than Biopython : bioinformatics

Background: I work at SecureDNA1, where we use Biopython pretty extensively. It’s a great library, but often quite slow, and we’ve run into bottlenecks in our processing pipelines around Biopython’s translation speed. I wrote this library to augment Biopython — you can read your sequences out of FASTA files with…

Continue Reading a Rust-backed Python library for DNA translation that is up to 100x faster than Biopython : bioinformatics

Extracting organism and seq from fasta

Extracting organism and seq from fasta 0 Hi, I am trying to extract sequences from a fasta file from a database with a specific organism species keyword from a .txt file containing the relevant headers. Do you know how I can do this in python as the biopython guide I’ve…

Continue Reading Extracting organism and seq from fasta

Fasta file reading python

Answer by Aidan Golden I think you can just use Biopython,It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.,Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.,Nicely…

Continue Reading Fasta file reading python

Sequence extraction

Sequence extraction 1 Hello, I have a fasta file that contains sequences of different lengths. I want to extract the base sequences greater than 500 and less than 10000bp and regenerate a fasta file. What should I do? Thanks a lot if anyone can help. extraction Sequence • 79 views…

Continue Reading Sequence extraction

Using Minimap2 with FMLRC2

Using Minimap2 with FMLRC2 1 Hello all, I am using FMLRC2 (github.com/HudsonAlpha/rust-fmlrc) to correct PacBio reads with Illumina reads for hybrid genome assembly. Since FMLRC2 only corrects reads (does not do any assembly) another program is needed. In the paper published on FMLRC minimap (now succeeded by minimap2, github.com/lh3/minimap2) was…

Continue Reading Using Minimap2 with FMLRC2

[lh3/minimap2] Memory leak when using Python and threads

The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond. This issue was first discovered in bonito and isolated to mappy. The data flow…

Continue Reading [lh3/minimap2] Memory leak when using Python and threads

Bwa on multiple processor

Hi Guys, When I am trying to run bwa mem on multiple processor, I am getting error as : > mpirun -np 16 bwa mem hg19-agilent.fasta R1.fastq R2.fastq | samtools sort -o aln.bam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read…

Continue Reading Bwa on multiple processor

processing in strelka2 with multiples bam file in directory

processing in strelka2 with multiples bam file in directory 0 If I manually tell strelka2 to use these three bam files below, then I get the desired results of 3 individually genome files in results/variants. xxx_00.bam yyy_01.bam zzz_02.bam ${path_to_strelka}/bin/configureStrelkaGermlineWorkflow.py –bam xxx_00.bam –bam yyy_01.bam –bam zzz_02 –referenceFasta <fasta> –callRegions <.bed.gz> –runDir…

Continue Reading processing in strelka2 with multiples bam file in directory

MARS seq alingment

MARS seq alingment 0 Hello everyone, new here and also new to the field. was asked to create a pipeline for RNA seq and after two months of self learning of how to interact with each code im stuck with the program STAR. what im trying to do for now…

Continue Reading MARS seq alingment

Secret BBMAP helper page – HRGV/Marmics_Metagenomics Wiki

#How to map to the assembled scaffolds.fasta bbmap is a powerful and highly flexible read mapper jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/. For the upcoming analysis you are not interested in the typical mapping output but in statistics on the coverage on every scaffold, you can get them with scaffstats. We want to be specific…

Continue Reading Secret BBMAP helper page – HRGV/Marmics_Metagenomics Wiki

Alignment report

Alignment report 0 Hi Guys, I did alignment of R1 and R2 fastq files with reference genome using bwa mem and got bam file. Now, I want to check whether the alignment is done correctly and alignment percentage,coverage etc. I run following command: bwa mem hg19.fasta R1.fastq R2.fastq | samtools…

Continue Reading Alignment report

makeblastdb creating multiple files of unexpectedly large sizes

I have a set of 100 amino acid sequences and I want to perform a BLASTP sesrch against the refseq_protein database. Accordingly I had set up the standalone version of BLAST (Version 2.11.0+) and downloaded the refseq_protein database from NCBI using the following code wget ftp.ncbi.nlm.nih.gov/refseq/release/complete/*.faa.gz The database gets downloaded…

Continue Reading makeblastdb creating multiple files of unexpectedly large sizes

hg38 Import custom reference upload error

Our version of TS is 5.12.2 When trying to upload new custom reference fasta (downloaded from ncbi ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, gunzipped and renamed to hg38.fasta) through “Import custom reference” in interface an error occures: “uploaded file size is incorrect” (to be honest the error was not shown in logs, because of TypeError…

Continue Reading hg38 Import custom reference upload error

sequence alignment – Help with MinION sequencing data species identification

Hi I’m new to bioinformatics and have just completed my first run on the MinION (long read sequencing Oxford Nanopore Technologies). I was hoping someone could direct me towards R packages, workflow, tutorials or guides that will help me identify species that are present in my sample mainly for fungi…

Continue Reading sequence alignment – Help with MinION sequencing data species identification

Custom genetic database – Deepmind/Alphafold

It is possible, but only with a code change in data/pipeline.py: If the database is a FASTA file, you could add a new Jackhmmer searcher for that database. You can take a look at the jackhmmer_uniref90_runner and basically follow the same logic for your database. If the database is a…

Continue Reading Custom genetic database – Deepmind/Alphafold

Failed to instantiate plugin dbNSFP in VEP

Failed to instantiate plugin dbNSFP in VEP 0 Hi Team, My VEP (version 105, installed by perl INSTALL.pl) works well. But I face some problems to use dbNSFP plugin (also installed by perl INSTALL.pl) with VEP tool. My dbNSFP version 4.2a was installed by the following code without any warning…

Continue Reading Failed to instantiate plugin dbNSFP in VEP

What is RNAcentral? | RNAcentral

RNAcentral is a database of non-coding RNA sequences that aggregates ncRNA data from over 40 member resources known as Expert Databases.1 Non-coding RNAs Similar to mRNAs, non-coding RNAs (ncRNAs) are transcribed from DNA but are not translated into proteins. NcRNAs are found in all organisms and have a broad range…

Continue Reading What is RNAcentral? | RNAcentral

A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. Finding fuzzy seed matches enable BLEND to find both 1) exact-matching seeds…

Continue Reading A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Dryad Data — Lichen fungi do not depend on the alga for ATP production

Lichen fungi live in a symbiotic association with unicellular phototrophs and have no known aposymbiotic stage. A recent study postulated that some of them have lost mitochondrial oxidative phosphorylation and rely on their algal partners for ATP. This claim originated from an apparent lack of ATP9, a gene encoding one subunit…

Continue Reading Dryad Data — Lichen fungi do not depend on the alga for ATP production

Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS

This blog post was contributed by Ankit Sethia, PhD, and Timothy Harkins, PhD, at NVIDIA Parabricks, and Olivia Choudhury, PhD,  Sujaya Srinivasan, and Aniket Deshpande at AWS. This blog provides an overview of NVIDIA’s Clara Parabricks along with a guide on how to use Parabricks within the AWS Marketplace. It…

Continue Reading Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS

samtools mpileup error – 1 samples in 1 input files

samtools mpileup error – 1 samples in 1 input files 0 Hi All, I have relatively new to bioinformatics and have encountered an issue when trying to generate an mpileup file with samtools. I have entered the following command samtools mpileup -f /home/path_to_reference/nCoV_Jan31.fa.fasta sorted_sample1.sam > sample.mpileup The message returned is…

Continue Reading samtools mpileup error – 1 samples in 1 input files

Novel bioinformatics pipeline for fast and scalable analysis of large viral phylogenies

A team of researchers recently developed a bioinformatics approach to analyze viral phylogenetic clusters and posted their findings to the bioRxiv* preprint server. Study: ClusTRace, a bioinformatic pipeline for analyzing clusters in virus phylogenies. Image Credit: M. PATTHAWEE/Shutterstock Background Coronavirus disease 2019 (COVID-19)…

Continue Reading Novel bioinformatics pipeline for fast and scalable analysis of large viral phylogenies

How to differenciate between 16s hypervariables regions using QIIME2 ? – User Support

M_F: May i search the sequences on ncbi for example correponding to v4 domain No, NCBI probably would not have such sequences in an easily indexed form but I could be wrong. Rather, grab some reference sequences (can be a random subsample, do not need all of them) and use…

Continue Reading How to differenciate between 16s hypervariables regions using QIIME2 ? – User Support

Blast command line pipeline not working

Blast command line pipeline not working 0 Hello, I am running now a local blast pipeline using MacOs. The goal here is to take interval of the 5 best hits and then extract the SNP variants from multiple vcf.gz files. But I am facing an error which I cannot solve….

Continue Reading Blast command line pipeline not working

Padding out a GVCF file with 1000G exomes to get gatk VariantRecalibrator working with a small sample

I’ve got sequencing data for a small 500 bp amplicon from a few samples. GATK best principles suggest running VariantRecalibrator on the GVCF files I generate. I’m trying to get this working, but I get an error about “Found annotations with zero variances”. Reading the gatk manual and other posts…

Continue Reading Padding out a GVCF file with 1000G exomes to get gatk VariantRecalibrator working with a small sample

Error with file guillaumeKUnitigsAtLeast32bases_all.fasta, kUnitigLengths.txt is of size 0, must be at least of size 1.

Hello, I am trying running an assembly with MaSuRCa but am getting an error at the step: “Computing super reads from PE”. here’s the output with the error: [xxxx@vic Bovidae]$ cd Assembly_test/ [xxxx@vic Assembly_test]$ ls assemble.sh guillaumeKUnitigsAtLeast32bases_all.fasta.tmp masurca_assembly.o4302352 meanAndStdevByPrefix.pe.txt pe_data.tmp quorum_mer_db.jf work1 environment.sh guillaumeKUnitigsAtLeast32bases_all.jump.fasta masurca_config.txt pe.cor.fa pe.renamed.fastq super1.err ESTIMATED_GENOME_SIZE.txt masurca_assembly.e4302352…

Continue Reading Error with file guillaumeKUnitigsAtLeast32bases_all.fasta, kUnitigLengths.txt is of size 0, must be at least of size 1.

Indexing with STAR

Indexing with STAR 0 Hello, I am working with RNA seq data and creating an index of reference genome Gossypium hirsutum by using STAR. STAR asks GTF annotation format while my file is GFF3. According to literature, in order to run GFF file I need to remove –sjdbOverhang 50 and…

Continue Reading Indexing with STAR

alphafold2: HHblits failed – githubmemory

I’ve tried using the standard alphafold2 setup via docker (converted to a singularity container) via the setup described at github.com/kalininalab/alphafold_non_docker, and both result in the following error: […] E1210 12:01:01.009660 22603932526400 hhblits.py:141] – 11:49:18.512 INFO: Iteration 1 E1210 12:01:01.009703 22603932526400 hhblits.py:141] – 11:49:19.070 INFO: Prefiltering database E1210 12:01:01.009746 22603932526400 hhblits.py:141]…

Continue Reading alphafold2: HHblits failed – githubmemory

how to add reference alleles to VCF?

how to add reference alleles to VCF? 1 I’m converting gVCFs to VCF, but the reference alleles are missing. An example below: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 180525_FD02929177 1 97547947 . T . . . DP=31 GT:DP:RGQ 0/0:31:81 1 97915614 . C . . . DP=40…

Continue Reading how to add reference alleles to VCF?

gatk VariantRecalibrator positional argument error

I’m trying to use recalibrate my vcf using gatk VariantRecalibrator, but keep getting an error “Illegal argument value: Positional arguments were provided”. But I don’t know what this means, or how to correct it! Here’s my call: gatk VariantRecalibrator -R “/Volumes/Seagate Expansion Drive/refs/hg38/gatk download/Homo_sapiens_assembly38.fasta” -V “$OUT”/results/variants/”$SN”.norm.vcf.gz -AS –resource hapmap,known=false,training=true,truth=true,prior=15.0: “/Volumes/Seagate…

Continue Reading gatk VariantRecalibrator positional argument error

HISAT2 question, index generation.

HISAT2 question, index generation. 0 Hello everyone, I have a question. Perform a basic line of work for RNA-seq analysis. A question arose when I generated the famous index in Hisat2 using the .FASTA extension reference genome. What is it means the information that Hisat2 throws at the end. E.g…

Continue Reading HISAT2 question, index generation.

Using reverse-complement or just complement of SILVA database to filter rRNA from metatranscriptomics

Using reverse-complement or just complement of SILVA database to filter rRNA from metatranscriptomics 0 I want to use SILVA database to filter rRNA from metatranscriptomics. The nucleobases in the SILVA fasta file are A, U, C and G. Therefore, before indexing the SILVA fasta file, should I get the reverse-complement…

Continue Reading Using reverse-complement or just complement of SILVA database to filter rRNA from metatranscriptomics

get rRNA FASTA file for a particular bacteria

get rRNA FASTA file for a particular bacteria 0 Hey all, I was trying to find a way to get all rRNA (5S, 16S and 23S) FASTA sequences for a particular bacteria (B. thetaiotaomicron VPI-5482, which is the type strain). I wanted this file so that I could use something…

Continue Reading get rRNA FASTA file for a particular bacteria

Can AlphaFold2 be used to predict DNA binding from fasta sequences?

Can AlphaFold2 be used to predict DNA binding from fasta sequences? 0 I am using AlphaFold2 to predict the 3D structure of proteins from fasta sequences. I would like to evaluate whether this predicted structure binds to DNA or other ligands. Is it possible to evaluate this kind of binding…

Continue Reading Can AlphaFold2 be used to predict DNA binding from fasta sequences?

Get data from KEGG Brite

Get data from KEGG Brite 0 Hi, I would like to retrieve all the interactions between ligands and target proteins from the KEGG BRITE database. Ideally, each entry will contain a protein name, a list of interacting ligands, its FASTA sequence and an sdf or mol2 coordinates of the ligand,…

Continue Reading Get data from KEGG Brite

How to retrieve fasta sequence after local blast?

How to retrieve fasta sequence after local blast? 1 Hello, I have created a Blast database using a reference genome. Then, I have performed a local blast search in command line using a gene of interest. I have obtained some hits with the usual Blasting information. Now, I want to…

Continue Reading How to retrieve fasta sequence after local blast?

Getting sequence from any fasta based on coordinates

Getting sequence from any fasta based on coordinates 1 I have received coordinates of several genes (not annotated) and was said the origin is TAIR10. I wanted to extract these sequences based only on this information, but encounter several doubts. I know it seems trivial but I am curious whether…

Continue Reading Getting sequence from any fasta based on coordinates

Biopython: Bio.SeqUtils.molecular_weight for a fasta file

I must write a function, given a file_name that can calculate the molecular weight of only the unambiguous sequences and gives as return sequence id and the corresponding molecular weight. I tried to use the Bio.SeqUtils.molecular_weight to calculate the molecular weight, but I couldn’t do it since SeqUtils.molecular_weight works with…

Continue Reading Biopython: Bio.SeqUtils.molecular_weight for a fasta file

Running SortMeRNA on Multiple Files

Running SortMeRNA on Multiple Files 0 Hi all, I am VERY new to SortMeRNA (I’m a PhD student taking a bioinformatics class that has been very poorly taught). I have 27 paired samples for a total of 54 samples named like this: SRR13711719_1_val_1.fq SRR13711719_2_val_2.fq. So the format is _1_val_1.fq and…

Continue Reading Running SortMeRNA on Multiple Files

How to find the longest orf from a transcriptome

How to find the longest orf from a transcriptome 0 Hello, good day, sorry for the simplicity but, I have a super basic question. I have been trying to identify the longest ORF of a transcriptome for a long time, but from the previous failed attempts it seems that it…

Continue Reading How to find the longest orf from a transcriptome

How to write out a ID and AA sequence from a SWISS PROT database file into a new file in a specific order using python?

I have a Swiss-Prot database file that contains several Swiss-Prot Files. They are copied and pasted underneath each other. Therefore there is one Swiss-Prot entry after another listed in the same file. I want to write the ID into another file as the header. Immediately underneath, I want to write…

Continue Reading How to write out a ID and AA sequence from a SWISS PROT database file into a new file in a specific order using python?

FastTree error while constructing tree

Hey All, I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative). My…

Continue Reading FastTree error while constructing tree

NCBI datasets bulk protein fasta download

NCBI datasets bulk protein fasta download 0 Hi, I want to download protein fasta files for a set of bird species. I have the genome assembly accessions in a file. I feel like every time I need to bulk download fasta files I’ve forgotten how I did it last time…

Continue Reading NCBI datasets bulk protein fasta download

Aligning large sets of sequences

Hello folks, I am seeking an advice on Multiple Sequence Alignment that I am trying to get. The fasta file i am trying to align belongs to Sars-Cov-2 Spike protein, it has nearly 600k sequences and ranges from 1270-1275 aa. I have aligned with clustalo and mafft with default parameters….

Continue Reading Aligning large sets of sequences

Making a FASTA file from a segment of a DNA sequence

Making a FASTA file from a segment of a DNA sequence 0 Hello everyone, I have copied a segment of a known DNA sequence and I want to turn this segment of DNA into a FASTA file in order to BLAST it against a custom made database. I mostly work…

Continue Reading Making a FASTA file from a segment of a DNA sequence

SnpEff does not create htmlStats

SnpEff does not create htmlStats 0 SnpEff does not create htmlStats with the below command: $ snpEff eff -Xmx20G LAB330 LabUsa16cWild01-20_L-Q.vcf | head ##fileformat=VCFv4.0 ##filedate=20210414 ##source=SGSautoSNP ##reference=NbLab330.genome.softmasked.fasta ##phasing=allhomozygote ##INFO=<ID=DP,Number=1,Type=Integer,Description=”Read depth over all samples”> ##INFO=<ID=PL,Number=0,Type=String,Description=”Panel”> ##SnpEffVersion=”5.0e (build 2021-03-09 06:01), by Pablo Cingolani” ##SnpEffCmd=”SnpEff LAB330 LabUsa16cWild01-20_L-Q.vcf ” ##INFO=<ID=ANN,Number=.,Type=String,Description=”Functional annotations: ‘Allele | Annotation…

Continue Reading SnpEff does not create htmlStats

igBLAST query/options error

igBLAST query/options error 2 When I try to run this command: igblastn -germline_db_V $GERMLINE_DB”/human_gl_HV” -germline_db_J $GERMLINE_DB”/human_gl_HJ” -germline_db_D $GERMLINE_DB”/human_gl_HD” -organism human -domain_system imgt -query $WORKDIR”https://www.biostars.org/”$FILE”.fasta” -auxiliary_data $IGBLASTDIR”/optional_file/human_gl.aux” -outfmt 7 -num_threads 4 -num_alignments_V 5 -out $FILE”_tab.igblast” I get this error: BLAST query/options error: Germline annotation database human/human_V could not be found in…

Continue Reading igBLAST query/options error

increasing word size extremely slows down the search

standalone blastp: increasing word size extremely slows down the search 1 Hello, I need to blastp a genome (15,000 seqs) against genome (12,000 seqs) using Biopython. I decided to use local blast and query genome 1 fasta file against genome 2 database ( made by makeblastdb command with second genome…

Continue Reading increasing word size extremely slows down the search

how to create a custom database (GTDB) ?

how to create a custom database (GTDB) ? 0 Hello I was asked for creating a custom database from GTDB, I just need to incorporate some metagenome assembly genomes (MAGs) to the GTDB database the issue is that I dont know how to do that. the GTDB file “gtdbtk_data.tar.gz” from…

Continue Reading how to create a custom database (GTDB) ?

pfam_scan.pl can’t find the pfamdb

pfam_scan.pl can’t find the pfamdb 1 I am trying to run pfam_scan.pl script which keep generating this error below though both Pfam-A.hmm and Pfam-A.hmm.dat files are in /pfamdb. Can someone please help me identify the errors and resolve this? perl /media/owner/b45f8e7a-003c-4573-8841-bcb5f76f281f/sn/rgaugury/PfamScan/pfam_scan.pl -fasta Hannuus_494_r1.2.protein.fa -dir /media/owner/b45f8e7a-003c-4573-8841-bcb5f76f281f/sn/rgaugury/database/pfamdb FATAL: can’t find “Pfam-A.hmm” and/or…

Continue Reading pfam_scan.pl can’t find the pfamdb

replacing fasta headers

replacing fasta headers 0 Hi, I would like to modify the fasta headers from a file. I would like to change: >A0A0F2M4U6|A0A0F2M4U6_SPOSC Endoplasmic reticulum chaperone BiP OS=Sporothrix schenckii 1099-18 OX=1397361 GN=SPSK_04019 PE=3 SV=1 by >A0A0F2M4U6 Thanks in advance! format header fasta • 42 views Login before adding your answer. Read…

Continue Reading replacing fasta headers

find positions of a short sequence in a genome

Here’s a demo Python script you can modify for your use, which suggests the rough principle: #!/usr/bin/env python import sys import re bed = “””chr1t0t10tABCDEFGHIJ chr1t5t15tFGHIJABCDO chr1t10t20tABCDOPABCD””” string_to_match = sys.argv[1] pattern = re.compile(string_to_match) for line in bed.split(“n”): (chr, start, stop, id) = line.split(“t”) for match in pattern.finditer(id): sys.stdout.write(“t”.join([chr, str(int(start) +…

Continue Reading find positions of a short sequence in a genome

How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on?

How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on? 3 I have a file with more than 15000nt sequence and i want it to be separated into 1000nt new files like F1,F2 ….. Example INTO…

Continue Reading How to cut a 15000 sequence file into multiple files of 1000nt each and save it in new files like F1,F2 and so on?

Index of /~psgendb/doc/local/zhangju/eclipse_project/Pl_MGCB2/Module_Dir

Name Last modified Size Description Parent Directory   –   Cazy_gbk_protein.pl 2015-11-09 14:56 2.5K   DataRetrieving.pl 2015-11-09 14:56 6.7K   GOshell.pl 2015-11-09 14:56 24K   ParseGffFile.pl 2015-11-09 14:56 17K   ParseKEGG.pl 2015-11-09 14:56 17K   Parse_Fasta.pl 2015-11-09 14:56 34K   Parse_Fastq.pl 2015-11-09 14:56 1.2K   Parse_Genbank.pl 2015-11-09 14:56 86K  …

Continue Reading Index of /~psgendb/doc/local/zhangju/eclipse_project/Pl_MGCB2/Module_Dir

Gene Expression Prediction from DNA sequences

Gene Expression Prediction from DNA sequences 1 Hi everyone! I am a university student working on my Master’s thesis. I worked on a paper called Xpresso which has the purpose to predict the gene expression levels starting from DNA sequences using deep learning techniques. Now, my lecturers have asked me…

Continue Reading Gene Expression Prediction from DNA sequences

Conserved regions in multiple fasta files (multiple species)?

Conserved regions in multiple fasta files (multiple species)? 0 Dear biostars members, I’ve multiple fasta files containing a different number of transcripts of different species (one file for each species). I want to find conserved regions in sequences between the species. I’m completely new to this topic. I’ve performed some…

Continue Reading Conserved regions in multiple fasta files (multiple species)?

Clustal Omega Output Not Correct

Clustal Omega Output Not Correct 1 Hello, I am having an issue with my biopython program. My project is due soon and I can’t figure out what’s going on. I am running this code based on a tutorial, and I’m new to python. Here is my code: from Bio import…

Continue Reading Clustal Omega Output Not Correct

How to count fastq reads

How to count fastq reads 9 ‘wc’ is faster than awk #yourfile.fastq echo $(cat yourfile.fastq|wc -l)/4|bc #yourfile.fastq.gz echo $(zcat yourfile.fastq.gz|wc -l)/4|bc for fasta files: grep -c “^>” file.fasta for fastq files: grep -c “^@” file.fastq for fastq files: awk ‘{s++}END{print s/4}’ file.fastq Here is the fancy script in bash: #!/bin/bash…

Continue Reading How to count fastq reads

window size

window size 0 if i have a fasta sequence and i want to trim the sequences based on the window size having Cys at the center and if the window size is less than the residue number i want it to fill it with some alphabet sp|P39688|FYN_MOUSE Tyrosine-protein kinase Fyn…

Continue Reading window size

Bioinformatics Algorithms In Perl, Front end, Cedric Notredame 2001

Perl Implementation of some basic sequence comparison algorithms. PLEASE NOTE: These algorithms are only meant to be used as pedagogic support. They are simplified versions of the real stuff and are not meant to be used for research purposes. Program Description Usage hello_World.pl Hello World… hello_world bubble_sort.pl sorts number in…

Continue Reading Bioinformatics Algorithms In Perl, Front end, Cedric Notredame 2001

Adding repeats in a genome fasta at a particular location without messing up the annotations?

Adding repeats in a genome fasta at a particular location without messing up the annotations? 0 I want to add a bunch of expanded repeats in a genome fasta file, for eg. 100 ATTs at a particular location eg Chr1-1:2. How do I that and at the same time update…

Continue Reading Adding repeats in a genome fasta at a particular location without messing up the annotations?

How to find sequence patterns in genome?

How to find sequence patterns in genome? 2 Hi, I want to find a pattern of sequence in a genome. Let’s say to find following pattern (G4N(1-10))5 that translates to 4 Guanines followed by 1 to 10 bases of either A or T or G or C and then this…

Continue Reading How to find sequence patterns in genome?

find the desired AA sequence location in Protein fasta file

find the desired AA sequence location in Protein fasta file 1 I am working with FASTA files of protein. I want to locate the desired AA sequence in every clone of the protein fasta file using pyhton. records=SeqIO.parse(“protein.fasta”, ”fasta”) #to extract protein sequences from FASTA file for record in records:…

Continue Reading find the desired AA sequence location in Protein fasta file

PDBe Download Service for bulk download of data

The PDBe pages provide a number of tools and visualisations to support analysis and understanding of PDB structures, however sometimes it is still necessary for users to have access to the data files. We have now introduced a file download service, allowing users to easily download coordinates and related data…

Continue Reading PDBe Download Service for bulk download of data

Creating BLASTN database for viruses

Creating BLASTN database for viruses 1 Hello guys, I need to create a local BLASTN db containing the fasta RefSeq sequences of viruses. I would like to download the fasta sequences from www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus using the browser, but it takes too long time and the connection fails (I am in China)….

Continue Reading Creating BLASTN database for viruses

Problems with consensus fasta

Hi everyone! I’m new in bioinformatics and also in informatics so I’m struggling a bit trying to learn on my own. At the moment I’m having some difficulties trying to generate a fasta file from a bam file of a complete human genome. I’ve red on the internet that a…

Continue Reading Problems with consensus fasta

consensus fasta

Hi everyone! I’m new in bioinformatics and also in informatics so I’m struggling a bit trying to learn on my own. At the moment I’m having some difficulties trying to generate a fasta file from a bam file of a complete human genome. I’ve red on the internet that a…

Continue Reading consensus fasta

FASTA file on a macbook air : bioinformatics

Hi, Im having trouble with converting a plain text file to a fasta file. I switched the option from rich text to plain text because even though it does save as fasta, it adds on words and letters , but… when I change .txt to .fasta, it automatically reverts to…

Continue Reading FASTA file on a macbook air : bioinformatics

linux – BWA Alignment in Linnux command line

I am trying to align a sequence using the BWA tool. The idea is to obtain a sam file with the alignment. I used the reference genome of S.cerevisiae downloaded in a fasta file. First I indexed the file. bwa index S.cerevisiae.fasta After that I proceed to align the reference…

Continue Reading linux – BWA Alignment in Linnux command line

Error when using featurecounts

I am doing some RNA analysis and am having issues trying to generate count data. I mapped my reads to a reference genome fasta file (genbank fasta file from ncbi) using bbmap and .sam files as the output. I am now trying to use featurecounts to generate count data but…

Continue Reading Error when using featurecounts

STAR index generation for bacterial genome

STAR index generation for bacterial genome 0 Hi, I’m trying to analyze RNA-Seq data for a bacteria – Mycobacterium tuberculosis. I used the FASTA and GTF files from NCBI to create the index, and set the –genomeSAindexNbases at 8 based on this previous post. The bash script I used is:…

Continue Reading STAR index generation for bacterial genome

How to extract specific samples (by ID) from Fasta file to new fasta file in R

How to extract specific samples (by ID) from Fasta file to new fasta file in R 1 I have a question concerning the extraction of sequences from a multy fasta file with sequence headers. I have been playing around and been looking all over the internet to find a solution…

Continue Reading How to extract specific samples (by ID) from Fasta file to new fasta file in R

Genome assembly

Genome assembly 2 Hello, It is probably a very basic question, yet i struggle to find an answer to it. My lab ordered a whole genome sequence in a commercial firm, not long ago we received from them a few .fasta files with many short sequences in them. As i…

Continue Reading Genome assembly

STAR Genome indexing (Homo_sapiens_assembly38.fasta vs. GRCh38.primary_assembly.genome.fa)

I have a a query regarding STAR alignment. I used the following commands to generate genome index. (Homo_sapiens_assembly38.fasta) STAR –runMode genomeGenerate –genomeDir /home/bsh/BC_MCFcellLine_WTS/result/STAR_indexing/ –genomeFastaFiles /data1/database/ftp.broadinstitute.org/bundle/hg38_210610_download/Homo_sapiens_assembly38.fasta –sjdbGTFfile /home/bsh/BC_MCFcellLine_WTS/gencode.v27.annotation.gtf And I used the following commands for mapping and bam file was successfully generated. STAR –runThreadN 4 –outFilterType BySJout –outFilterMismatchNmax 999 –outFilterMultimapNmax 10…

Continue Reading STAR Genome indexing (Homo_sapiens_assembly38.fasta vs. GRCh38.primary_assembly.genome.fa)

biopython – Updating the GFF3 + Fasta to GeneBank code

I’m trying to convert gff3 and fasta into a gbk file for usage in Mauve. I’ve found a solution but the code is outdated: “””Convert a GFF and associated FASTA file into GenBank format. Usage: gff_to_genbank.py <GFF annotation file> <FASTA sequence file> “”” import sys import os from Bio import…

Continue Reading biopython – Updating the GFF3 + Fasta to GeneBank code

help with filtering sequences to make a phylogenetic tree

help with filtering sequences to make a phylogenetic tree 0 Hi everyone I have a fasta with around 400 sequences, and I have to make a phylogenetic tree, before that I have to eliminate the duplicate sequences, I was thinking of doing it manually with an alignment and a distance…

Continue Reading help with filtering sequences to make a phylogenetic tree

bash if statement skipping python command

bash if statement skipping python command 0 Hi all, I’m writing a new script where I want to iterate over fasta files based on an array, and if the fasta files is present to make a directory, move the fasta to the directory, align the fasta with mafft to then…

Continue Reading bash if statement skipping python command

Has anyone here worked with CNCI before? EXHAUSTED

Has anyone here worked with CNCI before? EXHAUSTED 1 Has anyone here worked with CNCI before? I’m just about exhausted trying to figure out what I’m doing wrong. So, I tried the following: python CNCI.py candidate_lncs.gtf -g -o test -m ve -p 16 -d ./dbase GRCm38.primary_assembly.genome.fa and I recieved the…

Continue Reading Has anyone here worked with CNCI before? EXHAUSTED

Consensus sequence for phased variant calls

Consensus sequence for phased variant calls 0 I’ve got paired end sequencing data from a ~500 bp amplicon. I’ve aligned the data and called variants using gatk to phase the variants, as follows. The phasing information is now under the PGT tag. gatk HaplotypeCaller -R $REF -I “$BAM” -O “$DIR”/variants/${SN}_HaplotypeCallerPGT.vcf…

Continue Reading Consensus sequence for phased variant calls

Low Mapping/Alignment rates with bowtie2 in mouse reduced genome

Low Mapping/Alignment rates with bowtie2 in mouse reduced genome 0 Hello, I want to perform mapping in two ways. One way is to run bowtie2 with my mouse data in the reference of mm10 and in this case everything works correct. The other way of mapping is to map my…

Continue Reading Low Mapping/Alignment rates with bowtie2 in mouse reduced genome

AlphaFold2 | DGX GPU Cluster

AlphaFold2 from DeepMind has been released as an open source application.  At UNC Research Computing Center, we are able to run AlphaFold2 in our machines to provide protein 3D structure from a chain of amino acids.  Following the steps below, we will be able to invoke AlphaFold2 in Longleaf cluster….

Continue Reading AlphaFold2 | DGX GPU Cluster

How to replace/fill “Ns” in fasta with reference file having same coordinates

How to replace/fill “Ns” in fasta with reference file having same coordinates 0 Dear community, Hope you are doing great. As asked in title, please guide if there is any way to fill or replace N or N’s in fasta file with the help of reference file. For example Fasta…

Continue Reading How to replace/fill “Ns” in fasta with reference file having same coordinates

How to assess structural variation in your genome, and identify jumping transposons

Prerequisites Data An annotated genome Long reads Repeat annotation Software minimap2 samtools bedtools – for comparisons only tabix – for visualization only Installation 1 2 3 /work/gif/remkv6/USDA/04_TEJumper conda create -n svim_env –channel bioconda svim source activate svim_env Map your long reads to your genome with minimap My directory locale 1…

Continue Reading How to assess structural variation in your genome, and identify jumping transposons

how to get core genes FASTA from a mammal specie? : bioinformatics

Hello I been asked to build a database including a set of sea lion specie (Otaria flavescens) “core genes” what “core genes” actually mean? I wanted to build the database using the complete genome of that specie but the complete genome doesnt exist and there are only FASTAs for some…

Continue Reading how to get core genes FASTA from a mammal specie? : bioinformatics

Can’t get all coding sequences from list of protein IDs [Entrez Direct]

Can’t get all coding sequences from list of protein IDs [Entrez Direct] 1 I have a list of protein ids in number format (e.g. 25121878) that I want to retrieve the coding sequences for. This E-direct command is working for some, but for some it’s giving me ‘NO RESULT’: efetch…

Continue Reading Can’t get all coding sequences from list of protein IDs [Entrez Direct]

Replace fasta ID with value from TSV file (sed with special characters)

Replace fasta ID with value from TSV file (sed with special characters) 0 Command: sed -i “s/^>.*$/>$fastaid/g” output.fasta The desired output is to replace the entire fasta ID with everything stored in the variable $fastaid. Problem is that $fastaid looks like 12456789.AB.25/12/21 and it throws errors due to the special…

Continue Reading Replace fasta ID with value from TSV file (sed with special characters)

how to replace fasta header

how to replace fasta header 2 Hi! I have two files, ba.fasta and ba.header ba.fasta ba.header >bas12 >seq1 MDNAVGYH.. MDNAVGYH.. >bas13 >seq2 MLSRTEQR.. MLSRTEQR.. So basically I wanted to replace the header of ba.fasta with that of ba.header while keeping the sequences Any easy command for this please Tnks in…

Continue Reading how to replace fasta header

Trouble indexing a .vcf.gz file

Trouble indexing a .vcf.gz file 1 Hello everyone, I am trying to index a .vcf.gz file in order to get a fasta consensus with bcftools this is the simple command i give: tabix myFile.vcf.gz and I get the next error: [E: :get_intv] failed to parse TBX_VCF, was wrong -p [type]…

Continue Reading Trouble indexing a .vcf.gz file

Snakemake: MissingInputException

Snakemake: MissingInputException 0 Hello, I am trying to create a simple Snakemake workflow and I am having some issues. My file looks like this: ——————– ARCHIVE_FILE = ‘output.tar.gz’ **a single output file** OUTPUT_FILE = ‘output/{species}.out’ **a single input file** INPUT_FILE = ‘proteins/{species}.fasta’ **Build the list of input files.** INP =…

Continue Reading Snakemake: MissingInputException

Compare two protein FASTA files and give a excel that show header with the same sequence

Compare two protein FASTA files and give a excel that show header with the same sequence 2 Dear All, I have two files file1.fasta file2.fasta. Both contain some identical sequences but different headers. I want to know the correspondence relationship between the headers of the two fasta files and may…

Continue Reading Compare two protein FASTA files and give a excel that show header with the same sequence

Consensus sequence calling with normalisation of indels

Consensus sequence calling with normalisation of indels 0 I’m following the workflow suggested by samtools here to produce a fasta with the consensus sequence. samtools.github.io/bcftools/howtos/consensus-sequence.html The workflow goes like this: # call variants bcftools mpileup -Ou -f reference.fa alignments.bam | bcftools call -mv -Oz -o calls.vcf.gz bcftools index calls.vcf.gz #…

Continue Reading Consensus sequence calling with normalisation of indels

Trimming only custom adapter sequences

If you’ve already made a custom FASTA file for your adapters, can you post it, or a portion of it? Another question: do you get any output at all from the command you supplied? Your command looks OK as far as I can tell. Perhaps it doesn’t like your supplied…

Continue Reading Trimming only custom adapter sequences

Benchmarking different approaches for Norovirus genome assembly in metagenome samples | BMC Genomics

Assembly Raw data obtained from eight human Norovirus samples passed FASTQC (v0.11.5, Babraham Bioinformatics) quality filters regarding the parameters per base sequence quality, per sequence average quality, N content and adapter sequences after the trimming steps described in the methods section. Mean read length was 100 bp as expected from library…

Continue Reading Benchmarking different approaches for Norovirus genome assembly in metagenome samples | BMC Genomics

R: Predicting G quadruplexes

R: Predicting G quadruplexes gquad {gquad} R Documentation Predicting G quadruplexes Description This function predicts G quadruplexes in ‘x’ (nucleotide sequence(s)). Nucleotide sequence can be provided in raw or fasta format or as GenBank accession number(s). Internet is needed to connect to GenBank database, if accession number(s) is given as…

Continue Reading R: Predicting G quadruplexes

contigs.fasta – Genome – Assembly

##Taxonomic-Update-Statistics-START## This Genome (query)::GCA_901542325.1 Current Name::Bacillus altitudinis Previous Name::Bacillus xiamenensis Date Updated::2020-07-31 Analysis Type::Average Nucleotide Identity (ANI) Analysis 1 (A1)::Query vs. TYPE genome for current name A1 Genome (subject)::GCA_000691145.1 A1 Name::Bacillus altitudinis A1 ANI::98.48% A1 Query Coverage::92.69% A1 Subject Coverage::94.38% Analysis 2 (A2)::Query vs. TYPE genome for previous name A2…

Continue Reading contigs.fasta – Genome – Assembly