Python scikit-learn: prediction on dataset with text and numeric variables

The issue is you use TfidfTransformer which transforms a count matrix to a normalized tf or tf-idf representation instead of TfidfVectorizer which converts a collection of raw documents to a matrix of TF-IDF features

from sklearn.feature_extraction.text import TfidfVectorizer
X = pd.DataFrame({'Project Title': ['hello stackoverflow', 'text column', 'scikit learn', 'machine learning projects']})
vect = TfidfVectorizer(ngram_range=(1, 2))
tfidf = vect.fit_transform(X['Project Title'])
X_tfidf = pd.DataFrame(matrix.todense(), columns=vect.get_feature_names())

Read more here: Source link