Hello,

I am trying to perform a t-test in python to estimate if the expression of a gene is statistically different between two clusters of cells in an scRNA-seq experiment.

When I do the following:

```
from scipy import stats as st
st.ttest_ind(counts[gene_1], counts[gene_2])
```

I get a p-value for some of my comparisons. In others, however, I get the following:

```
Ttest_indResult(statistic=nan, pvalue=nan)
```

A nan instead of a p-value.

Please correct me if I am wrong, but I think a t-test requires the data to be normally distributed. I have log-transformed my gene counts, which should help with that, but I still get a very zero-inflated distribution. Find below histograms of the expression of a gene in two clusters of cells for which a t-test returns a nan value:

And here is a histogram of the expression of another gene in two clusters of cells for which a t-test returns a valid p-value:

It looks like that the reason is the low expression level of the cells that express some of the gene. But I don’t understand why this leads to a nan p-value.

I have a number of questions:

Am I doing right using a t-test in this case? Is there another test I should be using instead? What should I do with these nan p-values, or how should I avoid them? Any help understanding what is going on would be so useful.

Thanks so much and my apologies in advance for my poor knowledge in statistics.

Read more here: Source link