Hi! I’m new to bioinformatics and I’m working with 6 different RNA-seq(high throughput) studies from GEO, 3 of the GSEs contain gene expression for tumors and the other 3 contains healthy tissue.
I’m going to do batch correction, and I’m wondering do I merge all datasets together first and do normalization and batch correction on all together? Or do I merge the 3 GSEs for tumor-data and do normalization and batch correction on this merged dataset separately and then merge the 3 GSEs for healthy tissue and do the normalization/batch correction there, and then merge them all together if that make sense?
No, you don’t. You cannot randomly collect datasets and expect to then run any stats magic and make them comparable. You need indentical wetlabl processing for a fair comparison. Otherwise batch effects obscure the results. You cannot correct it as each batch (=each study) is nested with the condition (tumor/normal). A very common problem, and the only way around is to either find a study that produced case and control in go, or make the data yourself with proper study design. You have with these data above a fully confounded design, nothing you can do about it.
Oh, I see you asked this before and the answer was the same: