The Biostar Herald for Monday, April 17, 2023

The Biostar Herald publishes user submitted links of bioinformatics relevance. It aims to provide a summary of interesting and relevant information you may have missed. You too can submit links here.

This edition of the Herald was brought to you by contribution from Istvan Albert,
and was edited by Istvan Albert,


What are the advantages of using the T2T as a reference vs GRCh38 today? (www.biostars.org)

The best summary on reasons for using the T2G reference.

submitted by: Istvan Albert


submitted by: Istvan Albert


Variant calling and benchmarking in an era of complete human genome sequences | Nature Reviews Genetics (www.nature.com)

We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations.

submitted by: Istvan Albert


GitHub – marbl/ModDotPlot (github.com)

Mod.Plot is a novel dot plot visualization tool used to view tandem repeats, similar to StainedGlass. Mod.Plot utilizes modimizers to compute the Jaccard coefficient in order to estimate sequence identity. This significantly reduces the computational time required to produce these plots, enough to update in real time!

submitted by: Istvan Albert


[2303.13528] Many bioinformatics programming tasks can be automated with ChatGPT (arxiv.org)

Using 184 programming exercises from an introductory-bioinformatics course, we evaluated the extent to which one such model — OpenAI’s ChatGPT — can successfully complete basic- to moderate-level programming tasks. On its first attempt, ChatGPT solved 139 (75.5%) of the exercises. For the remaining exercises, we provided natural-language feedback to the model, prompting it to try different approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises.

submitted by: Istvan Albert


submitted by: Istvan Albert


Not all exons are protein coding: Addressing a common misconception: Cell Genomics00062-9) (www.cell.com)

Exons are regions of DNA that are transcribed to RNA and retained after introns are spliced out. However, the term “exon” is often misused as synonymous to “protein coding,” including in some literature and textbook definitions. In contrast, only a fraction of exonic sequences are protein coding (<30% in humans). Both exons and introns are also present in untranslated regions (UTRs) and non-coding RNAs. Misuse of the term exon is problematic, for example, “whole-exome sequencing” technology targets <25% of the human exome, primarily regions that are protein coding. Here, we argue for the importance of the original definition of an exon for making functional distinctions in genetics and genomics. Further, we recommend the use of clearer language referring to coding exonic regions and non-coding exonic regions. We propose the use of coding exome sequencing, or CES, to more appropriately describe sequencing approaches that target primarily protein-coding regions rather than all transcribed regions.

submitted by: Istvan Albert


Avoiding false discoveries: Revisiting an Alzheimer’s disease snRNA-Seq dataset | bioRxiv (www.biorxiv.org)

Here, we correct these issues with best-practice approaches to snRNA-Seq processing and differential expression, resulting 892 times fewer differentially expressed genes at a false discovery rate (FDR) of 0.05.

submitted by: Istvan Albert


Comparison of transformations for single-cell RNA-seq data | Nature Methods (www.nature.com)

The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties; however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives. This result highlights limitations of current theoretical analysis as assessed by bottom-line performance benchmarks.

submitted by: Istvan Albert


submitted by: Istvan Albert


Want to get the Biostar Herald in your email? Who wouldn’t? Sign up righ’ere: toggle subscription


Read more here: Source link