Which covariates and factors to include in model matrix

Hi all,

I am running a human whole blood transciptomics experiment where two groups of patients are compared. The two groups consist of patients with two different disease. The goal of the experiment is to get insight in the different profiles of gene expression between the two groups. Now I have read the EdgeR manual and other documentation online as on Pubmed, but I am still not sure how I can construct a valid design matrix.

Of course I have included the conditions in the design matrix, but I am wondering which other factors or covariates I should correct for. It seems to be that you should correct for factors/covariates that could influence gene expression. In online tutorials and articles these are mostly things as batch effects, specific time points or different treatment conditions. But the question I keep asking myself is, do I have to correct for covariates such as age, white blood cell differentiation, other laboratory measurments (such as NTproBNP, kidney function etc. etc.) or factors such as sex and comorbities (so for example hypertension, diabetes, COPD etc.).

The answer seems to be that you should correct for it IF you expect them to significantly alter gene expression between the groups. But how do I make sure that those factors indeed alter gene expression?! And moreover if gene expression between the groups is driven by some of those factors, aren’t that the differences that you want to find?

I have constructed the model matrix with the model.matrix function such as:

design <- model.matrix(~ 0 + types, data = y$samples)

Also I would like to know if I have to include an intercept term when including a covariate, so for a basic example:

design <- model.matrix(~types + age, data = y$samples)

or is the right way:

design <- model.matrix(~0 +types + age, data = y$samples)

Read more here: Source link