python – predictions on datasets

We were given 3 datasets: X_public , y_public and X_eval. We are supposed create a model public ones and generate prediction on X_eval and our success will be tested on y_eval, which we dont have.

So I created some code, to train my model and make some predictions on public ones and got some y_predict, but now I have no idea how to make prediction on X_eval, mostly because of error, that sizes do not match. X_public is 600/200 and X_eval is 200/200 at the beginning. Also, when I changed test_size to 0.3333, there wast an error, but predictions were really low.

How to make that prediction on X_eval and X_public?

Code:

get_ipython().magic('reset -sf')

y = np.load('y_public300.npy', allow_pickle=True)
X = np.load('X_public300.npy', allow_pickle=True)
X_eval = np.load('X_eval300.npy', allow_pickle=True)


# ONE HOT ENCODING
ohe = OneHotEncoder(sparse=False)

# dataset X_public
ohe_coded = ohe.fit_transform(X[:,180:200])
X = np.delete(X, slice(180, 200), 1)
X = np.concatenate((X, ohe_coded), axis=1)

ohe_coded2 = ohe.transform(X_eval[:,180:200])
X_eval = np.delete(X_eval, slice(180, 200), 1)
X_eval = np.concatenate((X_eval, ohe_coded2), axis=1)


X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0)


# SimpleImputer
simp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = simp.fit_transform(X_train)
X_test = simp.transform(X_test)

X_eval = simp.transform(X_eval )

# StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)           

X_eval_std = simp.transform(X_eval)


# PCA
pca = PCA(n_components=300)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

X_eval_pca = pca.transform(X_eval_std)


# PILELINE
pipe_svc = Pipeline([('clf', SVC(random_state=1))])
clf = SVC(random_state=1)


param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 
               'clf__kernel': ['linear']},
                 {'clf__C': param_range, 
                  'clf__gamma': param_range, 
                  'clf__kernel': ['rbf']}]

#GridSearch
gs = GridSearchCV(estimator=Pipeline([('clf', SVC(random_state=1))]), 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=10,
                  n_jobs=-1)


gs.fit(X_train_pca, y_train)


print("*****Best Score*****")
print(gs.best_score_)

svc = SVC(kernel="rbf", gamma="scale" ,probability=True ,random_state=0)
svc.fit(X_train_pca, y_train)
y_predict = svc.predict(X_test_pca)


print ("Presnost klasifikacie ", accuracy_score(y_predict, y_test)) 
print ("Presnost klasifikacie ", roc_auc_score(y_predict, y_test))

I was thinking of doing y_predict = svc.predict(X_eval_pca), but that is what gives me the error.

Read more here: Source link