scrnaseq – Cells with zero expression of a given gene

This has been debated for a few years now as the “dropout” problem, which is actually a mixture of different issues.

On one hand, you are sequencing a relatively low input at a depth that does not reach saturation (i.e. you don’t measure all there is to measure). This is partially because, on droplet data – such as 10X – you never use the same sequencing depth per single cell that you would use for a bulk measurement. You produce fewer sequencing reads.

Additionally, the way a library is generated within a droplet with such a low input implies that transcripts will compete for a relatively limited amount of enzymes and reagents.

This means that a highly expressed gene (i.e. one of which a single cell has produced several transcript molecules) will be more likely to be detected than one with fewer molecules (there are some biases but they do not explain alone the phenomenon). This would hardly ever happen in a bulk RNA sequencing dataset produced at an appropriate depth, and it is why we regard count data from scRNA-seq as “sparse” (with many 0’s).

On the other hand, there is a third issue to consider that is more “biological” than ‘technical”: the way we used to think about transcription was rather mechanistic (here I hope others will excuse this word).

What I mean by that is that you would think of transcription as an ordered set of steps whereby a transcription factor binds a promoter aided by other cofactors and chromatin complexes, recruit the RNA polymerase and let it do its job for as long/as many molecules as needed. However more biophysically motivated models generated from both in vitro and in vivo lines of evidence have painted a different picture, in which transcription is bursty, has restarts, early stops, and does not necessarily work as a putting a needle on a vinyl record. DNA loci may be transcribed at low levels due to phenomena that don’t immediately depend on transcriptional regulation, such as cell cycle dynamics. This is a big source of intercellular variability. If you think about it, these are large molecules operating in solution and associating/dissociating from other molecules according to statistical thermodynamics. If the “vinyl” view of transcription was good enough to make sense of most bulk RNA-seq results, where you are indeed observing the average expression of genes in a cell population/tissue, once we get at the single cell level such a model does not necessarily hold.

You know that GAPDH is a “housekeeping” gene because on a population average measured by RT-qPCR or Western Blot you never* see its levels fluctuate significantly, and they stay high compared to many other genes. At the single cell level, however, this need not be true at all times and for all cells. Hence a proportion of your zero-count genes will come from detection limits and low input (“technical” dropouts) and another proportion will come from the stochastic nature of transcription itself.

Now the important aspect that the field has reflected upon is how to model these dropout, or if we even need to. The goal of these models is to understand variability and noise in the signal, allowing us to extrapolate useful biological data. Should we use a specific distribution of counts that accounts for an excess of 0’s in the data (zero inflation)? Should we model dropouts entirely separately? Are distributions used in bulk RNA-seq analysis sufficient to model the data in an actionable and computationally efficient way? What about modelling burstiness and other biophysical properties?

I believe it can be argued that there is no “actual/correct statistical model” because models are choices and results should be interpreted in light of those choices and their assumptions. If you are looking to cluster your cells and find “markers” to characterize the composition of a tissue, a simple model may be sufficiently equipped to let you draw some interesting conclusions. If instead you want to look more in depth at dynamical properties (transitions between cell states, “RNA velocity”, splicing, etc.) you may need to use more rigorous, biophysically motivated models that are also more complex in their theoretical justifications and their implementation.

Here are some recommended readings if you want to dive deeper in the issue of dropouts and count modelling. The list is not exhaustive but I hope it helps.

*more precise measurements in different tissues/cell lines have shown housekeeping genes to be variable in bulk as well, but that’s another issue.

Read more here: Source link