I am new to the bioinformatics world and due to circumstances I have been given the complete responsibility to perform a human transciptomics data-analysis without bioinformatic background and for now also without supervision.
This whole project feels like a big challenge, finding a puzzle piece everytime.
It is a human transcriptomics analysis where we have QuantSeq data for 600 human patients with a certain condition which is similar in certain aspects but different in others. So we have 300 patients in one group and 300 patients in the other and all the data is from one time point. The data is preprocessed and I already have the unique read counts per sample in a table.
I have a couple of questions and I hope you guys can help me:
As far as I can tell we do not have biological replicates, I have the unique counts per sample for about 70K genes. So every column in my R dataframe corresponds with 1 patient (so no 2 columns per patient as is expected when you have biological replicates). Am I right to assume that we do not have biological replicates?
The steps I have taken so far are (using the EdgeR manual as a guide):
-Loading the dataset into R.
-Made a dataframe where the columns correspond with the samples and the rows with the genes.
-Made a DGE Object with the right condition per patient
-Filtered out lowly expressed genes with a raw count <10
-Performed normalization with the build in TMM-normalization method
Are the steps I did logical and didnt I miss something?
Also I would like to know how to proceed from here, I am expected to perform a differential expression and pathway analysis.
But I get the sense that not having biological replicates might be a big problem, does anyone have tips on how to proceed?
Any help is greatly appreciated and earns you a digital cappuccino!
The steps that you have taken so far are correct and following edgeR’s guide for your differential expression analysis is a good way to make sure you stay on the right track. The discourse over biological replicates is not really meaningful here. Having a single sample from each patient is by far the most common situation, as these don’t usually get profiled multiple times. A differential analysis is going to estimate the variability across a single biological condition and compare it to another, whether it is a KO vs WT experiment in a cell line or a patients’ groupA vs groupB as in your case. In this situation the samples from each group act as a “biological replicate” of each other to help you answer the question: “in what way the biology (transcriptome) of groupA is different from groupB?”. There will of course be a high level of heterogeneity among samples of the same group, but the high numbers of your cohort guarantee that you have enough statistical power to properly compare the two groups. In the case of a KO in a cell line, instead, you want to have true replicates, as there could be unwanted unspecific effects of the experiment in a single sample.