Bad news first: what you want can’t be done well. If you are doing this to learn the process, then it doesn’t matter what kind of data you have. But if you are doing this to make a clinically relevant model (or for research), your data is not sufficient.

You have what is commonly known as and **underdetermined system**, which in plain terms means that you have too many variables (in your case, genes) and not enough equations (in your case, samples). These kinds of systems either don’t have a solution (which is actually not bad), or have an infinite number of solutions (which is bad because it leads to overfitting).

Two ways out of this predicament: get more samples (in your case a lot more), or reduce the number of variables (which seems to be your choice). Now, reducing 2000 genes to 1000 or 500 would not be a problem, but you need to get them down to 10 or even below. If it was that easy to find only 10 genes responsible for cancer progression (or the lack of it), someone would have done it already.

Last piece of advice: complex models (random forests would qualify) tend to overfit terribly on underdefined problems. The only chance you have that will avoid overfitting – and not a great one given your particular setup – is to model this using simple, linear methods. Lasso would work because it uses L1 regularization, which will squeeze many regression coefficients down to zero and effectively remove many variables. Still, not sure that even lasso can reliably eliminate enough variables that your data demands. If you still want to give it a try, a python solution is to run lasso in cross-validation mode, which also will find the optimal parameter for alpha by fitting it across a range of values. If you decide to try it, I suggest the number of folds equal to your actual number of samples, which essentially becomes a leave-one-out cross-validation (LOOCV).

scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

One last time: I want to stress that most likely you don’t have enough data to make a reliable model no matter what kind of data wrangling is employed.

Read more here: Source link