I want to use scikit learn logistic regression to train a model on a labelled single cell RNA sample and subsequently apply this model on new unlabelled single cell RNA seq samples to annotate the cells in these samples. The sample I use to train the model has about 23000 genes but the unlabelled samples have a different number of genes. The trained model expects an input of 23000 genes so I want to ask what the best approach would be.
I could alter the count table of each of the unlabelled samples and add the missing genes with value 0 for all cells and use this to fit the model on. But this would introduce false information.
I could take the intersection of all genes for all samples and train the model on this set of common genes. But then the model depends a lot on the specific samples that are being analysed.
So none of these options seem correct to me.
I have successfully trained the model on the labelled data and I am now struggling to find the right strategy to proceed.
I would very much appreciate any input!
Read more here: Source link