Tag: CHR

where to obtain hg19 reference transcriptome in gencode

where to obtain hg19 reference transcriptome in gencode 1 Hi, I want to use salmon to generate an index and later use it again to get a table of counts that I can use for differential expression analysis with the hg19 version. For the hg38 reference genome I have had…

Continue Reading where to obtain hg19 reference transcriptome in gencode

impute tool /imputing GTs _ generated imputed file has unexpected no. of cols

Hello, I’m trying to use impute tool to impute more snps using ref panels with my GT data, i followed impute2 tutorial, its example/toy data works fine, then I used the faster impute4 and impute5 tools for mydata as the run gets killed on impute2 with my data. anyways, the…

Continue Reading impute tool /imputing GTs _ generated imputed file has unexpected no. of cols

r – Error adding significance levels in ggplot boxplot after changing the stats table

I am trying to add significance levels to a geom_boxplot by using stats_pvalue_manual. Since I have a 12-level factor I ordered the data by mean size. I only keep the significant comparisons that are closest to each other. If I use the stats table everything works (but looks horrible). Here…

Continue Reading r – Error adding significance levels in ggplot boxplot after changing the stats table

Phylogenomic analysis, cryptic species discovery, and DNA barcoding of the genus Cibotium in China based on plastome data

Introduction Traditional Chinese herbal medicine plays an indispensable role in the treatment of multiple diseases in China and other developing countries (Newman et al., 2008). Apart from the traditional usage, many medicinal plants, such as Artemisia annua L. (artemisinin, Tu, 2016), Huperzia javanica (Sw.) C. Y. Yang (Huperzine A, Zangara, 2003;…

Continue Reading Phylogenomic analysis, cryptic species discovery, and DNA barcoding of the genus Cibotium in China based on plastome data

How to calculate TPM from featureCounts output

How to calculate TPM from featureCounts output 0 I would like to find the TPM counts for the GSE102073 study. When i downloaded the raw data from GEO, the raw data are featureCounts output. First part of the file: # Program:featureCounts v1.4.3-p1; Command:”/data/NYGC/Software/Subread/subread-1.4.3-p1-Linux-x86_64/bin/featureCounts” “-s” “2” “-a” “/data/NYGC/Resources/ENCODE/Gencode/gencode.v18.annotation.gtf” “-o” “/data/analysis/LevineD/Project_LEV_01204_RNA_2014-01-30/Sample_JB4853/featureCounts/Sample_JB4853_counts.txt” “/data/analysis/LevineD/Project_LEV_01204_RNA_2014-01-30/Sample_JB4853/STAR_alignment/Sample_JB4853_Aligned.out.WithReadGroup.sorted.bam”…

Continue Reading How to calculate TPM from featureCounts output

r – ggplot – reorder_within – Order month

I am plotting a bar graph in RSTUDIO. However, the axes x show in different month order. Using the command reoder_within, doesn’t work for the purpose. Follow below the ME. ggplot(de, aes(fill=Cidade, y = Leitura , x = Mes ))+geom_bar(position=’dodge’, stat=”identity”) Generate the follow plot: Plot My purpose is modify…

Continue Reading r – ggplot – reorder_within – Order month

How to extract haplotype data from phased bcf files

How to extract haplotype data from phased bcf files 1 Hello, I have filtered/processed phased bcf files from wgs. I would like to extract the haplotype data per sample, so that I have a tab delim file which looks like this: Sample Chr Pos hap1 hap2 AW23 chr1 1234 A…

Continue Reading How to extract haplotype data from phased bcf files

How to determine the exact version of hg38 if I have only the FASTA file

How to determine the exact version of hg38 if I have only the FASTA file 1 I have a FASTA file which contains hg38 assembly. It contains the primary contigs, alt contigs, decoy, HLA, mito. How do I determine the exact version of hg38 based on the FASTA? Here some…

Continue Reading How to determine the exact version of hg38 if I have only the FASTA file

What Are The Most Common Stupid Mistakes In Bioinformatics?

Forum:What Are The Most Common Stupid Mistakes In Bioinformatics? 78 While I of course never have stupid mistakes…ahem…I have many “friends” who: forget to check both strands generate random genomic sites without avoiding masked (NNN) gaps confuse genome freezes and even species but I’m sure there are some other very…

Continue Reading What Are The Most Common Stupid Mistakes In Bioinformatics?

A score file with chr & positions instead of SNP IDs

Hello, I am calculating PRSs using the command:  plink2 \–bfile plink_file \–score score_file 1 2 3 header cols=+scoresums,-scoreavgs \–out PRS_sum Currently,  the columns in my score_file are: SNP IDs, effect allele, beta weights. I was wondering if possible to give instead of SNP IDs in the first column, chrs and…

Continue Reading A score file with chr & positions instead of SNP IDs

problems with MAF for MutSigCV (vcf2maf)

I am trying to run MutSIgCV and got stuck with this error: MutSigCV allsamples.md.tc.ir.br.pr.ug.dbsnp.vep.maf \ “$anno”exome_full192.coverage.txt \ “$anno”gene.covariates.txt \ my_results \ “$anno”mutation_type_dictionary_file.txt \ “$anno”chr_files_hg19 ====================================== MutSigCV v1.4 (c) Mike Lawrence and Gaddy Getz Broad Institute of MIT and Harvard ====================================== MutSigCV: PREPROCESS ——————– Loading mutation_file… Error using MutSigCV>MutSig_preprocess (line 246)…

Continue Reading problems with MAF for MutSigCV (vcf2maf)

DanMAC5: a browser of aggregated sequence variants from 8,671 whole genome sequenced Danish individuals | BMC Genomic Data

Demographics Data from three studies were included: Dan-NICAD: 1,649 individuals with symptoms of obstructive coronary artery disease, predominantly chest pain, undergoing coronary computed tomography angiography. In total, 52% were females, the mean age was 57 years (+/- 9 SD), median coronary artery calcium score were 0 [0–82] and 24% of…

Continue Reading DanMAC5: a browser of aggregated sequence variants from 8,671 whole genome sequenced Danish individuals | BMC Genomic Data

ExomeDepth error in getBamCounts when adding a fasta reference

ExomeDepth error in getBamCounts when adding a fasta reference 1 Whenever I try to add a reference fasta file for computing the GC content in the GetBamCounts function: my.countsV6 <- getBamCounts(bed.frame =AgilentV6, bam.files = BAMFiles, include.chr = TRUE, referenceFasta = “data/hg19.fa” ) I get an error like this: Reference fasta…

Continue Reading ExomeDepth error in getBamCounts when adding a fasta reference

How to plot combined bar graph in R

How to plot combined bar graph in R 1 After running the following function count_table <- table(Sample.@meta.data$seurat_clusters, Sample@meta.data$orig.ident, Sample@meta.data$singlr_labels) View(count_table) str(count_table) ‘table’ int [1:15, 1:2, 1:20] 0 0 0 0 0 0 2 0 273 0 … – attr(*, “dimnames”)=List of 3 ..$ : chr [1:15] “0” “1” “2” “3”…

Continue Reading How to plot combined bar graph in R

PhaeoEpiView: an epigenome browser of the newly assembled genome of the model diatom Phaeodactylum tricornutum

Despite being an established model, the genome and annotation of P. tricornutum are not concordant and the existing epigenetic resources generated so far lack a well-defined framework for accurate and user friendly utilization. To address this limitation, we sought to establish a coherent resource rendering the multiple genomic and epigenomic…

Continue Reading PhaeoEpiView: an epigenome browser of the newly assembled genome of the model diatom Phaeodactylum tricornutum

Allele frequency visualization

Allele frequency visualization 2 Hello dears all, Actually I have a .frq file (allele frequency file) which It is generated from .vcf file through VCFtools and it is included by five populations (The vcf file is generated by Stacks pipeline). But the point is, I don’t know how can I…

Continue Reading Allele frequency visualization

How to add conditional annotation from other column of R dataframe in ggplot2

I’m trying to make a plot from a dataframe of over 300k rows where the peaks and valleys will be annotated from another column rather than the x and y. How can I do that..!! Dataframe : chr start end.x StoZ.x Hscore end.y StoZ.y Tier Gene 1 chr1 1 10000…

Continue Reading How to add conditional annotation from other column of R dataframe in ggplot2

r – How to customize y-axis intervals and marks in ggplot with limited intervals and 40 unit increments?

I am trying to make a ggplot, but I am not sure how to change the y-axis intervals or show specific intervals on the y-axis. How do you make the y-axis of a ggplot go up by 40, with the min being 40 and the max being 200, but only…

Continue Reading r – How to customize y-axis intervals and marks in ggplot with limited intervals and 40 unit increments?

count protein-coding genes per contig

count protein-coding genes per contig 1 If you have an annotation file, as for example, the following GTF from human: 1 havana gene 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”; gene_name “DDX11L1”; gene_source “havana”; gene_biotype “transcribed_unprocessed_pseudogene”; 1 havana transcript 11869 14409 . + . gene_id “ENSG00000223972”; gene_version “5”;…

Continue Reading count protein-coding genes per contig

Mbedtls PK – Loading the private and public key into the same context – Generic

Hello, Is there any hack that makes it possible to load a private key into a pk context where you’ve only loaded a public key (or vise versa)? I can see, that it is possible to instantiate a single context with both, but I cannot parse both PEM files into…

Continue Reading Mbedtls PK – Loading the private and public key into the same context – Generic

Edit and re-head BAM file

Edit and re-head BAM file 0 Hi there I have a BAM file which needs to be edited and re-headed. Now, I’m aware of how to do so the problem is that for some reason the sed command I’m using does not catch the sequence I have to remove… Below,…

Continue Reading Edit and re-head BAM file

Removing indels in VCF file

Hello, I am trying to do something very simple, but running into confusing behaviour. I have a VCF file of multiple samples and want to remove all indels so that I can generate sequences with identical coordinates with bcftools consensus. I removed indels by specifying bcftools view –include ‘TYPE=”snp” ||…

Continue Reading Removing indels in VCF file

PLINK: Whole genome data analysis toolset – dbSNP

1. Introduction 2. Basic information 3. Downloads and general notes 4. Command reference table 5. Basic usage/data formats 6. Data management 7. Summary stats 8. Inclusion trim 9. Population stratification 10. IBS/IBD estimation 11. Bond 12. Family-based community 13. Permutation process 14. LD calculations 15. Multimarker tests 16. Conditional haplotype…

Continue Reading PLINK: Whole genome data analysis toolset – dbSNP

–update-name not working – possibly because of comma delim?

Hi there, I have two separate sets of gwas data (‘A’ and ‘B’) with a file per chr.  I am running a targeted gene/pathway analysis so I have filtered out snps per each gene location (209 snps = 209 filtered gene-specific files). For dataset A I ran the following commands:…

Continue Reading –update-name not working – possibly because of comma delim?

KING kinship inference error

KING kinship inference error 2 Hello, I’m trying to use KING for estimating kinship between individuals. I have a vcf file that resulted from a de novo assembly of ddrad seq data. First, I tried to convert my vcf file to .bed file using Plink2.0, but it failed saying that…

Continue Reading KING kinship inference error

Genetic immune escape landscape in primary and metastatic cancer

Inference of HLA-I tumor status with LILAC Inference of the correct HLA-I tumor status is fundamental to identifying GIE alterations (Fig. 1a), to estimate the neoepitope repertoire and burden and to predict the response to immune checkpoint inhibitors14,15 (ICIs). We have developed LILAC, a framework that performs HLA-I typing for…

Continue Reading Genetic immune escape landscape in primary and metastatic cancer

A draft human pangenome reference

Sample selection We identified parent–child trios from the 1KG in which the child cell line banked within the NHGRI Sample Repository for Human Genetic Research at the Coriell Institute for Medical Research was listed as having zero expansions and two or fewer passages, and rank-ordered representative individuals as follows. Loci…

Continue Reading A draft human pangenome reference

Finding TF motifs enriched in series of ATAC-seq peaks (using fimo?)

Finding TF motifs enriched in series of ATAC-seq peaks (using fimo?) 1 Hi all, I have a series of peaks located in a .txt file (chr / start / end) and would like to know if there are tf motifs enriched in each of the individual peaks. For eg, I…

Continue Reading Finding TF motifs enriched in series of ATAC-seq peaks (using fimo?)

How can I create more accurate admixture plot like this?

How can I create more accurate admixture plot like this? 0 I made simple vcf file processing as plink –vcf Asp_multicall2.vcf.gz –double-id –allow-extra-chr \ –set-missing-var-ids @:# \ –set-missing-var-ids @:# \ for K in 1 2 3 4 5 do admixture –cv plink.bed $K | tee log${K}.out done As you can…

Continue Reading How can I create more accurate admixture plot like this?

Annotate vcf file using GNOMAD

Hi, I use a loop for that. Something like this to inspire you: # Enter folder where gnomAD data are here: gnomAD=”/path/to/gnomAD/database/release/3.1.2/gnomad.genomes.v3.1.2.sites.” # Enter the folder where your results are and will be annotated further cd /path/to/your/results/folder/ # Enter the name of the final results’ file from SnpSift ann=”results.ann.gnomAD.genomes.v3.1.2.vcf” #…

Continue Reading Annotate vcf file using GNOMAD

Will this method for genomic population structure analysis with VCF file help good for my work?

Will this method for genomic population structure analysis with VCF file help good for my work? 0 I was finding method for handling multi-calling VCF file and convert format for analyzing population genomics for fungi, which means dividing species or strains by genotype difference. So I found this site’s instruction….

Continue Reading Will this method for genomic population structure analysis with VCF file help good for my work?

Jobs lost as assets of biotech company are sold off

Administrators confirmed 30 jobs have been lost at Leeds-headquartered biotech business, 4D Pharma. Interpath, which has been handling the administration of the group, confirmed the intellectual property of 4D Pharma, including its patent library, pipeline assets, and bioinformatics platform technology, was sold to CJ Bioscience Inc (CJ). CJ is an independent…

Continue Reading Jobs lost as assets of biotech company are sold off

Introduction to Computational Evolutionary Biology

R is a very flexible programming language, and it allows developers to create their own data structures (called classes) for their packages. Over the years, some packages have become so popular that the classes they use to store data are now used the “standard” representations for particular types of data….

Continue Reading Introduction to Computational Evolutionary Biology

Technical Enzymes Market Revenue Analysis, Company Revenue Share, Global Forecast till 2032

Reports And Data Technical Enzymes Market report also sheds light on supply chains and the changes in the trends of the upstream raw materials and downstream distributors. NEW YORK, NY, UNITED STATES, May 8, 2023 /EINPresswire.com/ — The global market size for technical enzymes was USD 12.1 billion in 2022…

Continue Reading Technical Enzymes Market Revenue Analysis, Company Revenue Share, Global Forecast till 2032

r – unable to plot sd bars in ggplot

I was trying to insert the sd bars in geom_point in a data of decomposition rates of different time observations and species using the code below. My data have two factors time and specie and the varable “rem”. I used the aggregate function to group species and time and I…

Continue Reading r – unable to plot sd bars in ggplot

How to read large size tab separated multiple files in R exe using SLURM?

I’m attempting to write a function that reads several tab-separated files (thousands) in SLURM via sbatch using R exe code, does some tasks, and then outputs tab-separated files. The program works effectively with files smaller than 2 GB but fails to execute when the file size exceeds 2 GB. However,…

Continue Reading How to read large size tab separated multiple files in R exe using SLURM?

High number of duplicates and low percentage properly paired

High number of duplicates and low percentage properly paired 0 I have some paired end sequencing data that I have trimmed using cutadapt. It was sequenced on an illumina novaseq 6000 and is low coverage RADseq data (2-3x). My cutadapt script used forward and reverse adapters from illumina : cutadapt…

Continue Reading High number of duplicates and low percentage properly paired

Tophat: Genome indexing error

Hi everyone, I have 2 samples, paired end RNA sequence from illumina. I have done the QC and trimming. now after triiming with trimgalore I have A_R1.val_fq A_R2.val_fq B_R1.val_fq B_R2.val_fq Now I performed tophat for genome mapping. I ran this command tophat -p4 -G GRCh38_latest_genomic.fna -o tophat_result bowtie_index A_R1_val.fq but…

Continue Reading Tophat: Genome indexing error

Obtaining TPM values from STAR alignment and counts with featurecounts using R’s tidyverse syntax (dplyr and tidyr)

Hello! I have a table of counts that I got by aligning rna seq samples with STAR and using featureCounts, and my goal is to get TPM values for each gene of the table. As a first step, I imported my table into R and modified it a bit to…

Continue Reading Obtaining TPM values from STAR alignment and counts with featurecounts using R’s tidyverse syntax (dplyr and tidyr)

Low SNP Overlap with Michigan 1KG and TopMed reference panel

I extracted three samples (HG02024 – HG02026) from the 1000 Genomes Project’s 30x alignment files, employing the Genome Analysis Toolkit (GATK) best practice pipeline. This process involved performing base quality score recalibration, identifying and removing duplicate reads, utilizing the HaplotypeCaller to generate a genomic VCF (gVCF) file, and calling variants…

Continue Reading Low SNP Overlap with Michigan 1KG and TopMed reference panel

r – Add label at the top af a barplote using a diferent column – ggplot2

I have a dataframe in R #creating a dataset codigo <- paste0(“COD”, sprintf(“%03d”, 1:4)) vgene <- paste0(“VG”, LETTERS[1:8]) set.seed(123) codigo <- sample(codigos, 40, replace = TRUE) vgene <- sample(vgene, 40, replace = TRUE) datos_vgene <- data.frame(codigo, vgene) codigo vgene 1 COD001 VGG 2 COD002 VGG 3 COD003 VGC 4 COD004…

Continue Reading r – Add label at the top af a barplote using a diferent column – ggplot2

r – How to add multiple simulation plot lines to plot generated with ggplot?

A few changes to make this happen: Fix a typo, “Confidence Interval” -> “…vals” I recommend named values in scale_color_manual instead of values and labels; it should be the same, I’ve found myself in do-loops in the past where the resolution was to do this. Pivot/reshape the simPaths_df data so…

Continue Reading r – How to add multiple simulation plot lines to plot generated with ggplot?

r – ggplot x-axis labels with all x values, multiple lines on geom_line

I’m plotting ggplot with geom_line. The x-axis is the YEAR, and y-axis is a continuous variable N. How can I ggplot all and of the years on the x-axis? Applying factor(YEAR) on the aesthetic doesn’t seems to work because I’m splitting by type on linetype = STATUS Also the usual…

Continue Reading r – ggplot x-axis labels with all x values, multiple lines on geom_line

Can’t convert .ped .map into .bim .fam .bed. Why?

(PLINK): Can’t convert .ped .map into .bim .fam .bed. Why? 0 Hi. I have a .ped and .map that I have previously created from a .vcf. Note that this VCF was NOT created by aligning reads to a reference genome, but rather by de novo (using the “STACKS” pipeline). I…

Continue Reading Can’t convert .ped .map into .bim .fam .bed. Why?

Bioconductor – Bioconductor 3.17 Released

Home Bioconductor 3.17 Released April 26, 2023 Bioconductors: We are pleased to announce Bioconductor 3.17, consisting of 2230 software packages, 419 experiment data packages, 912 annotation packages, 27 workflows and 3 books. There are 79 new software packages, 7 new data experiment packages, no new annotation packages, 2 new workflows,…

Continue Reading Bioconductor – Bioconductor 3.17 Released

error in Genome Mepping by BWA tools in Linux

$ gmap_build -D:\btau8refflat.gtf Unknown option: D:btau8refflat.gtf -k flag not specified, so building main hash table with default 15-mers -j flag not specified, so building regional hash tables with default 6-mers gmap_build: Builds a gmap database for a genome to be used by GMAP or GSNAP. Part of GMAP package, version…

Continue Reading error in Genome Mepping by BWA tools in Linux

Potential segfault bug in featureCounts using long read data

Hi, I think I might have found a bug in featureCounts from Rsubread (v2.12.3). I am trying to find reads overlapping exon junctions from a personalised reference, using Nanopore long read BAMs. I am afraid I cannot share fully reproducible code as I am using my own reference, but this…

Continue Reading Potential segfault bug in featureCounts using long read data

Merge the same gene and get percentage of gene from the output file of mosdepth tool

Merge the same gene and get percentage of gene from the output file of mosdepth tool 0 I have get this output by using mosdepth. It give me coverage at 1X, 5X,10X,15X ,20X , 30X. In some regions genes are repeated as you can see in the example. Can be…

Continue Reading Merge the same gene and get percentage of gene from the output file of mosdepth tool

Should you specify “-p” for paired end reads using featureCounts?

Should you specify “-p” for paired end reads using featureCounts? 0 I’m trying to understand whether or not I should be using the -p flag for featureCounts. Here’s the explanation: -p If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads;…

Continue Reading Should you specify “-p” for paired end reads using featureCounts?

Gnu Parallel – Parallelize Serial Command Line Programs Without Changing Them

Article describing tool (for citations): O. Tange (2011): GNU Parallel – The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47. Author’s website for obtaining code: www.gnu.org/software/parallel/ All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many…

Continue Reading Gnu Parallel – Parallelize Serial Command Line Programs Without Changing Them

Illumina HumanHT-12 V3.0 expression beadchip reading data

Edit November 28, 2020: Further reproducible code: A: GPL6883_HumanRef-8_V3_0_R0_11282963_A (illumina expression beadchip) — Most Illumina ‘chip’ studies that I have seen on GEO do not contain the raw data IDAT files. You can start with the tab-delimited file, but will also require the annotation file (contained in the *_RAW.tar file),…

Continue Reading Illumina HumanHT-12 V3.0 expression beadchip reading data

Highly variable hearing loss due to POU4F3 (c.37del) is revealed by longitudinal, frequency specific analyses

ADSNHL is caused by a novel variant in POU4F3 (DFNA15) Targeted Sanger sequencing for known DFNA/B pathogenic variants in the NL population yielded no hits. Genome-wide genotyping and linkage analysis on one branch of the North American kindred yielded three equally positive “logarithm of the odds” scores (LOD = 2.08) suggestive of…

Continue Reading Highly variable hearing loss due to POU4F3 (c.37del) is revealed by longitudinal, frequency specific analyses

plink produces does pruning according to log file but prune.in is full of dots

Hi, I’m running a big SNP database (160GB vcf file) I pruned with plink; plink –vcf SNPs_clean1.vcf –allow-extra-chr –indep-pairwise 50 10 0.1 –out SNP_50_10_01 It starts running producing the temporary files and the output files prune.in and prune.out). In the log file it shows having filtered the variants, 49595998 of…

Continue Reading plink produces does pruning according to log file but prune.in is full of dots

What is the best way to retrieve gene name from variant mutation?

What is the best way to retrieve gene name from variant mutation? 0 Hi all! I was wondering if any of you may suggest a clean a fast way to retrieve gene name from a list of variants that look like this: CHR_POS_REF_ALT > chr1_33333_A_C Of course I can generate…

Continue Reading What is the best way to retrieve gene name from variant mutation?

Manifest Reference Strand

Manifest Reference Strand 0 Hi I am trying to generate a vcf from SNP array data (genomestudio). So far I can convert to plink format from genomestudio and create a vcf via plink2 commands: plink2 \ –pedmap gs-out-plink \ –make-pgen \ –sort-vars \ –merge-x \ –out test-1 then: plink2 \…

Continue Reading Manifest Reference Strand

chromosome identifier not found in FASTA file, prompting ‘skipping’ error, when chr identifier is there

Hi, I have a FASTA file that looks like this: >D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:6:1214:4434:71826 TGATCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG >D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2203:12900:37094 ACTTTGGTCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGG >D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:1:2213:1470:83635 Which I created through the code below. The .bam file in question is an aDNA file from a Neanderthal (cdna.eva.mpg.de/neandertal/Chagyrskaya/BAM/). samtools bam2fq chr22.rh.bam > chag.22.fq bioawk -c fastx ‘{print “>”$name”\n”$seq}’ chag.22.fq > chag.22.fa I also…

Continue Reading chromosome identifier not found in FASTA file, prompting ‘skipping’ error, when chr identifier is there

How to generate a .raw file from vcf in plink

How to generate a .raw file from vcf in plink 1 Hi all. I am trying to convert my vcf file to a readable format in R using plink. Specifically, I want a .raw file to follow up the workflow of a recent paper. I have tried: plink –vcf myfile.vcf.gz…

Continue Reading How to generate a .raw file from vcf in plink

Help in replicating LDSC heritability estimates

Hi, I am trying to replicate the heritability estimates based on the insomnia GWAS summary statistics using LDSC. However, I have encountered a problem as my estimates seem to be only about half of the original estimates listed in Table S1. Despite my efforts to locate the error, I have…

Continue Reading Help in replicating LDSC heritability estimates

adding transcripts ID to the REVEL score

adding transcripts ID to the REVEL score 0 I am annotating variants for clinical diagnosis. One of the data required by my team is REVEL. We work with specific transcripts so when clinicians do variant interpretation, they need to check if our preferred transcript is among the transcripts considered. REVEL…

Continue Reading adding transcripts ID to the REVEL score

Convert VCF files to input file format requiered by XP-CLR1.0 software

Convert VCF files to input file format requiered by XP-CLR1.0 software 0 I attempt to do a selective sweep analysis by using the software XP-CLR1.0. My genotypic data is VCF format so wonder if someone with experience on this package could recommend any tool to convert from VCF to the…

Continue Reading Convert VCF files to input file format requiered by XP-CLR1.0 software

divide coordinates into equal number of bins

First you need to convert to GRanges. gr <- paste0(coords$Chr,’:’,coords$Start,’-‘,coords$End) This should be built in to the standard GRanges package: web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/GenomicRanges/html/tile-methods.html gr <- GRanges( seqnames=Rle(c(“chr1”, “chr2”, “chr1”, “chr3”), c(1, 3, 2, 4)), ranges=IRanges(1:10, end=11), strand=Rle(strand(c(“-“, “+”, “*”, “+”, “-“)), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13)) # split every…

Continue Reading divide coordinates into equal number of bins

GATK g CNV code Aborts Before Running Main Code

GATK g CNV code Aborts Before Running Main Code – Bioinformatics Stack Exchange …

Continue Reading GATK g CNV code Aborts Before Running Main Code

ggplot2 – R: ggplot is giving a wrong and obscure graph

Your code to convert the %m-$d-%Y dates into Date format is not working the way you intended. dates <- c(“1-2-2018”, “1-3-2018”) for (i in 1:length(dates)) { dates[[i]] <- as.Date(dates[[i]], format = “%m-%d-%Y”) } str(dates) #chr [1:2] “17533” “17534” Compare to dates <- c(“1-2-2018”, “1-3-2018”) dates <- as.Date(dates, format = “%m-%d-%Y”)…

Continue Reading ggplot2 – R: ggplot is giving a wrong and obscure graph

Getting chromosome of unusual chromosome names e.g. ‘CHR_HSCHR8_8_CTG1’

Getting chromosome of unusual chromosome names e.g. ‘CHR_HSCHR8_8_CTG1’ 0 I made a biomaRt query: library(biomaRt) mart = useMart(‘ensembl’, dataset=”hsapiens_gene_ensembl”) genes = getBM(attributes = c(“chromosome_name”,”start_position”, “hgnc_symbol”, “uniprot_gn_symbol”, “uniprot_gn_id”), mart = mart, values = list(“protein_coding”,c(1:22))) Most of the chromosome_name values are regular numbers 1 to 22. However, some are unusual, such as…

Continue Reading Getting chromosome of unusual chromosome names e.g. ‘CHR_HSCHR8_8_CTG1’

rs3750846 RefSNP Report – dbSNP

ALFA Allele FrequencyThe ALFA project provide aggregate allele frequency from dbGaP. More information is available on the project page including descriptions, data access, and terms of use. Release Version: 20201027095038 Help Frequency tab displays a table of the reference and alternate allele frequencies reported by various studies and populations. Table lines,…

Continue Reading rs3750846 RefSNP Report – dbSNP

Reading BGEN file using Hail on a Spark cluster results in corrupted matrix table – Hail Query & hailctl

I have a BGEN file that I obtained from a PGEN file via PLINK2. The command I used is as follows: plink2 –pfile my_file –export bgen-1.2 bits=8 –out /some/path –output-chr chrM and I used hl.index_bgen() to create an idx2 index. Now when I have the BGEN file and the idx2…

Continue Reading Reading BGEN file using Hail on a Spark cluster results in corrupted matrix table – Hail Query & hailctl

Which ATAC-seq SAM/BAM fields are required for bigwig generation by deeptools bamCoverage?

Which ATAC-seq SAM/BAM fields are required for bigwig generation by deeptools bamCoverage? 0 Hi, I plan to convert SAM file into BED file using bedops sam2bed tool and perform “chr start and end coordinates” data points manipulation to break di-nucleosome and tri-nucleosome data points into mono-nucleosome data points, then convert…

Continue Reading Which ATAC-seq SAM/BAM fields are required for bigwig generation by deeptools bamCoverage?

Annotated dendogram (tree) from tbl_graph

Annotated dendogram (tree) from tbl_graph 0 @ee934b4b Last seen 15 hours ago France Hello, I have a problem because I have little experience with the tbl_graph format and how to use it well with ggtree (or other script to draw a tree). I have created a tbl_graph (from clustree) of…

Continue Reading Annotated dendogram (tree) from tbl_graph

10x Cellranger SN RNA-seq BAM file output looks off

I’m working with SN RNA-seq data that I processed using Cellranger v. 7.1.0 (which uses STAR for alignment with default settings). Using the samtools flagstat command to look at a BAM file output: 284890368 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0…

Continue Reading 10x Cellranger SN RNA-seq BAM file output looks off

Understanding how cohesin makes DNA loops in the human genome and its role in Cornelia de Lange syndrome

NIPBL KD affects cohesin-STAG1 and cohesin-STAG2 in opposite ways. a Asynchronously growing HeLa cells mock transfected (control) or transfected with siRNA against NIPBL (NIPBL KD) were analyzed by flow cytometry 72 h later. Contour plots of the indicated proteins in control (gray plots) and NIPBL KD cells (colored plots) were overlapped…

Continue Reading Understanding how cohesin makes DNA loops in the human genome and its role in Cornelia de Lange syndrome

Life Expectancy Data from Kaggle v1

The data-set related to life expectancy and health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. It was collected from WHO and United Nations website with the help of Deeksha Russell and Duan…

Continue Reading Life Expectancy Data from Kaggle v1

SNP ID (rsID) to Chr no. and Position

SNP ID (rsID) to Chr no. and Position 0 Hi, I have SNP arrays with rs ID, log R and BAF values, but I don’t have the ID’s chr. number and position. The platform that were used are: GPL6433 Illumina HumanHap550 Genotyping BeadChip v1 GPL6434 Illumina HumanHap550 Genotyping BeadChip v3…

Continue Reading SNP ID (rsID) to Chr no. and Position

Michigan Imputation server failed job

Michigan Imputation server failed job 0 This is the first time I use Michigan Imputation Server. My imputation worked as it should on all chromosomes (1 to 22) apart from chromosome 9, where the job starts and then it fails and I don’t know why and how to fix this…

Continue Reading Michigan Imputation server failed job

how to get genomic range for hi-c datasets in hdf5 file

how to get genomic range for hi-c datasets in hdf5 file 0 Hello everyone, i have hi-c contact map in hdf5 file. I stored the data for chr9 in a variable. But it doesn’t have genomic range. Does anyone know how to add colnames and rownames with respect to genomic…

Continue Reading how to get genomic range for hi-c datasets in hdf5 file

Something wrong with my code concerning bsseq visualization

Here is my code: #### library(bsseq) library(bsseqData) ## —-showData—————————————————————– data(BS.cancer.ex) BS.cancer.ex <- updateObject(BS.cancer.ex) BS.cancer.ex pData(BS.cancer.ex) ## —-smooth,eval=FALSE——————————————————– # BS.cancer.ex.fit <- BSmooth( # BSseq = BS.cancer.ex, # BPPARAM = MulticoreParam(workers = 1), # verbose = TRUE) ## —-showDataFit————————————————————– data(BS.cancer.ex.fit) BS.cancer.ex.fit <- updateObject(BS.cancer.ex.fit) BS.cancer.ex.fit ## —-cpgNumbers————————————————————— ## The average coverage of CpGs…

Continue Reading Something wrong with my code concerning bsseq visualization

TSV with multi-allelic sites to VCF

TSV with multi-allelic sites to VCF 0 Dear all, I am trying to convert a TSV output of some tools to a valid VCF. This is what a sample of the file looks like: mutID Chr Position Ref Alt nANNO N_eps N_PSI tAD 474 chr2 50282084 GT G mappable 0.293935566…

Continue Reading TSV with multi-allelic sites to VCF

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

I use python package to load and analysis vcf.gz files, for ch in chs: vcf_to_1240K_hdf(in_vcf_path = f”/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/GLIMPSE_ligated/merged_chr{ch}.vcf.gz”, path_vcf = f”/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.vcf”, path_h5 = f”/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.h5″, marker_path = f”/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/filters/snps_bcftools_ch{ch}.csv”, map_path = f”/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k.snp”, af_path = f”/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k_AF_ch{ch}.tsv”, col_sample_af = “”, buffer_size=20000, chunk_width=8, chunk_length=20000, ch=ch) but get this error below: UnicodeDecodeError Traceback (most recent call last)…

Continue Reading UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

vegan – Input distance matrix from QIIME2 to R

I am trying to coerce a distance matrix from QIIME2 to class dist in R. Upstream processing up to the generation of the distance matrix was performed as described in the QIIME2 document “Moving Pictures”. When I call as.dist, I get the following warning messages: imported_dist_matrix <- as.dist(dist_bray_df) ## Warning…

Continue Reading vegan – Input distance matrix from QIIME2 to R

ggplot2 – ggplot: “No non-missing arguments to min/max; returning Inf”

I’m attempting to recreate this plot (my version: lat/lon by year), but keep getting these warnings after running the ggplot code: sms2 |> mutate(fCYR = factor(CYR)) |> ggplot(aes(x = Longitude, y = Latitude, fill = est, group = fCYR)) + geom_raster(aes(x = Longitude, y = Latitude, fill = est, group…

Continue Reading ggplot2 – ggplot: “No non-missing arguments to min/max; returning Inf”

How to plot coverage and depth statistics of a bam file

This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. While this question is 2 years old but I am hoping there are many individuals trying to work out…

Continue Reading How to plot coverage and depth statistics of a bam file

qiime2 distance matrix as input for R – General Discussion

Hello, I’m trying to analyse skin Microbiota from human samples. (V3-V4 Illumina Miseq.) I finished the moving pictures tutorial up to the point of creating the distance matrices, everything looks neat. Now I’d like to take these results into R to create PCoA Plots. 1.) I exported the Bray-Curtis-Distance-Matrix from…

Continue Reading qiime2 distance matrix as input for R – General Discussion

r – ggplot reorder y-axis by value of x-axis

I have this data set: bizdev company date category <chr> <chr> <date> <chr> 1 Jerry Harding Jones, Figueroa and Moon 2020-04-06 Mid-tier 2 Steve Vargas Stanley, Porter and Mccoy 2020-04-24 Indie 3 Steve Vargas Gardner and Sons 2020-04-19 Indie 4 Jerry Harding Parks-Chen 2020-04-11 Indie 5 Jerry Harding Bond Group…

Continue Reading r – ggplot reorder y-axis by value of x-axis

TxDB.Hsapiens.UCSC.hg38.knownGene with locateVariants() identifying SNPs from various chromosome being part of the same gene

I am trying to annotate a list of SNPs using the hg38 genome (knownGene) and locateVariants(). The program is able to successfully run and provide “GeneIDs” for several of the loci. However, some GeneIDs are applied to SNPs in completely different regions and on completely different chromosomes. When I cross…

Continue Reading TxDB.Hsapiens.UCSC.hg38.knownGene with locateVariants() identifying SNPs from various chromosome being part of the same gene

ChrX allele frequency in males and females

ChrX allele frequency in males and females 0 Hi, I have a question about the allele frequency output files (.frq) from VCFtools. My question is specific to chromosome X variants. I am writing out the allele frequencies for males and females separately to identify the variants with different allele frequencies….

Continue Reading ChrX allele frequency in males and females

[Solved] I need support for my R Studio class. I have a file as cvs and I…

I need support for my R Studio class. I have a file as cvs and I need to make an x-y scatter plot of date on the x-axis and hind-foot length on the y-axis, with different colors for the different genera. Image transcription text 25 surveys$date = ymd(paste(surveys$year, surveys$month, surveys$day,…

Continue Reading [Solved] I need support for my R Studio class. I have a file as cvs and I…

How to get human cDNA sequences together with UTR regions?

How to get human cDNA sequences together with UTR regions? 2 Dear all, I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5′ and 3′ UTRs for protein-coding genes only. What…

Continue Reading How to get human cDNA sequences together with UTR regions?

Find genes which are related to SNPs

Find genes which are related to SNPs 0 Hi Dears Is there any chance to get SNPEff or some files which work like that in NCBI or other dataset?? Actually I just want to find genes related to SNPs which are in a map file based on Oar_v4.0 or Oar_v3.1…

Continue Reading Find genes which are related to SNPs

Batch correction and Deseq2

Hi, I would like to ask that the deta input fort Deseq2 is required a count metric. However, after I look into my count data, I would that there is a batch effect in my count data. How should I normalized my read count data before applying in Deseq2 and…

Continue Reading Batch correction and Deseq2

r – Cloropleth using ggplot

id field in GeoJSON is not handled by broom::tidy() as-is, so there are no State codes in the resulting data frame and join will not work as intended. Though the column named id is still there, thus left_join() will not throw an error. If you check your spdf_Brasil_MVI, you may…

Continue Reading r – Cloropleth using ggplot

How to create a graph that shows pairwise Fst values at single SNP loci across the all chromosomes in R

this is not a complete answer, but it may help to empower you to create your own graphs. i try to keep track of a lot of different visualization related tools (cmdcolin.github.io/awesome-genome-visualization/), but indeed, just hand-coding the solution in R is a good approach. I get my mind blown by…

Continue Reading How to create a graph that shows pairwise Fst values at single SNP loci across the all chromosomes in R

samtools idxstats versus samtools view command

samtools idxstats versus samtools view command 1 Hi, I have mapped RNA-seq data to the human genome concatenated with a viral genome (26 chromosomes in total) with bowtie and need to get some numbers to calculate FPKM values manually for one viral gene, to retrieve the “total number of reads”…

Continue Reading samtools idxstats versus samtools view command

Realigning BAM files to new reference

Hi, I am looking to create a panel of normals for a somatic variant caller. For normals, I have been provided with a set of WES bam files that have been preprocessed according to GATK best practices. However, they have been aligned to another reference genome than my case samples…

Continue Reading Realigning BAM files to new reference

Failure working with the tmp directory

Nextflow Gatk DepthOfCoverage: Failure working with the tmp directory 0 I have nextflow workflow for which the process DepthOfCoverage failed to work with the defined tmp directory –tmp-dir tmp process pf_read_depth { tag “tag” scratch true publishDir … input: tuple val(pair_id), path(pf_bam) path refdir output: file(“final_${pair_id}.tsv”) script: “”” samtools index…

Continue Reading Failure working with the tmp directory

Python pgenlib assertion error when reading pgen file with flipped alleles

Hi Chris I want to use –ref-allele to create a .pgen file with a new set of reference alleles and use pgenlib in Python to read the data. I directly generated the .pgen file using  plink2 \    –pfile /n/scratch3/users/j/jz286/imp_geno/ukb_imp_chr${CHROM}_v3\    –ref-allele ${REF_FILE} 10 3\    –make-pgen \    –out…

Continue Reading Python pgenlib assertion error when reading pgen file with flipped alleles

allele frequency from vcf

I am trying to figure out why the allele frequencies I get from Plink and vcftools are different from what I have on the vcf file. My vcf file was produced by GATK jointgenotyping and filtered using vcftools. I have 448 samples in there. Here is a line from the…

Continue Reading allele frequency from vcf

Extract reads within given region, and their mates

Extract reads within given region, and their mates 1 Hi there, I want to extract all the reads located inside a given location. I would like to extract also the mates of those reads, because I’m working with PE data. I know that I can extract the reads using samtools…

Continue Reading Extract reads within given region, and their mates

data wrangling – Expanding dataframe in R

An option in tidyverse would be to expand the data with uncount (similar to rep(…) in base R), then convert the ‘Dummy_N’ column to binary by creating a logical vector with the sequence (row_number()) and its value, grouped by ‘Group’ column library(dplyr)# version >= 1.1.0 library(tidyr) df %>% uncount(Total_N, .remove…

Continue Reading data wrangling – Expanding dataframe in R

line 1 did not have 7 elements. –Putting bismark extractor files into methimpute

Error: line 1 did not have 7 elements. –Putting bismark extractor files into methimpute 0 file <- system.file(“extdata”,”file.txt”, package=”methimpute”) data(arabidopsis_chromosomes) arabidopsis_chromosomes$chromosome <- sub(‘chr’, ”, arabidopsis_chromosomes$chromosome) bismark.data <- importBismark(file, chrom.lengths=arabidopsis_chromosomes) I am trying to interpret my bismark results into methimpute but it gives me an error of : line 1 did…

Continue Reading line 1 did not have 7 elements. –Putting bismark extractor files into methimpute

Creating a trendline on a bar plot with ggplot2 in R

The discrete fill scale of your plot shows that the round variable you use for x and fill is not numeric, preventing geom_smooth from working properly. Reproduction of the issue library(dplyr) library(ggplot2) mydf <- iris %>% mutate(round = as.character(as.numeric(as.factor(Species)))) %>% group_by(round) %>% summarise(means = mean(Sepal.Length), se = sd(Sepal.Length)/sqrt(n())) %>% ungroup()…

Continue Reading Creating a trendline on a bar plot with ggplot2 in R

Admixture error in removing loci with no genotypes

I am attempting to run ADMIXTURE on plink derived .bed files and continue to get the error “Error: detected that all genotypes are missing for a SNP locus. Please apply quality-control filters to remove such loci.” I have already performed mutiple quality control steps prior to this point. E.g: Filtered…

Continue Reading Admixture error in removing loci with no genotypes