Scikit-Learn Cheat Sheet (2022), Python for Data Science

The absolute basics for beginners learning Scikit-Learn in 2022

Photo from Unsplash by Tim Stief

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis. It’s built on NumPy, SciPy, and Matplotlib.

Sections:
1. Basic Example
2. Loading the Data
3. Training and Test Data
4. Preprocessing the Data
5. Create Your Model
6. Model Fitting
7. Prediction
8. Evaluate Your Model’s Performance
9. Tune Your Model

Basic Example:

The code below demonstrates the basic steps of using scikit-learn to create and run a model on a set of data.

The steps in the code include: loading the data, splitting into train and test sets, scaling the sets, creating the model, fitting the model on the data, using the trained model to make predictions on the test set, and finally evaluating the performance of the model.

>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X,y = iris.data[:,:2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X,y)
>>> scaler = preprocessing_StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)

Loading the Data

Your data needs to be numeric and stored as NumPy arrays or SciPy spare matrix. Other types that convert to numeric arrays, such as Pandas DataFrame’s are also acceptable.

>>> import numpy as np
>>> X = np.random.random((10,5))
array([[0.21069686, 0.33457064],
[0.23887117, 0.6093155 ],
[0.48848537, 0.62649292]])
>>> y = np.array(['A','B','A'])
array(['A', 'B', 'A'])

Training and Test Data

Splits the dataset into training and test sets for both the X and y variables.

>>> from sklearn.model_selection import train_test_split
>>> X_train,X_test,y_train,y_test = train_test_split(X,y, random_state = 0)

Preprocessing The Data

Getting the data ready before the model is fitted.

Standardization

Standardize’s the features by removing the mean and scaling to unit variance.

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standarized_X = scaler.transform(X_train)
>>> standarized_X_test = scaler.transform(X_test)

Normalization

Each sample (i.e. each row of the data matrix) with at least one non-zero component is rescaled independently of other samples so that its norm equals one.

>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)

Binarization

Binarize data (set feature values to 0 or 1) according to a threshold.

>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold = 0.0).fit(X)
>>> binary_X = binarizer.transform(X_test)

Encoding Categorical Features

Encode’s target labels with values between 0 and n_classes-1.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit_transform(X_train)

Imputing Missing Values

Imputation transformer for completing missing values.

>>> from sklearn.impute import SimpleImputer
>>> imp = SimpleImputer(missing_values = 0, strategy = 'mean')
>>> imp.fit_transform(X_train)

Generating Polynomial Features

Generate a new feature matrix consisting of all polynomial combinations of the features with degrees less than or equal to the specified degree.

>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> poly.fit_transform(X)

Create Your Model

Creation of various supervised and unsupervised learning models.

Supervised Learning Models

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize = True)
  • Support Vector Machines (SVM)
>>> from sklearn.svm import SVC
>>> svc = SVC(kernel="linear")
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors = 5)

Unsupervised Learning Models

  • Principal Component Analysis (PCA)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components = 0.95)
>>> from sklearn.cluster import KMeans
>>> k_means = KMeans(n_clusters = 3, random_state = 0)

Model Fitting

Fitting supervised and unsupervised learning models onto data.

Supervised learning

  • Fit the model to the data
>>> lr.fit(X, y)
>>> knn.fit(X_train,y_train)
>>> svc.fit(X_train,y_train)

Unsupervised learning

  • Fit the model to the data
>>> k_means.fit(X_train)
  • Fit to data, then transform it
>>> pca_model = pca.fit_transform(X_train)

Prediction

Predicting test sets using trained models.

#Supervised Estimators
>>> y_pred = lr.predict(X_test)#Unsupervised Estimators
>>> y_pred = k_means.predict(X_test)
  • Estimate probability of a label
>>> y_pred = knn.predict_proba(X_test)

Evaluate Your Model’s Performance

Various regression and classification metrics that determine how well a model performed on a test set.

Classification Metrics

>>> knn.score(X_test,y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test,y_pred)
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test,y_pred))
>>> from sklearn .metrics import confusion_matrix
>>> print(confusion_matrix(y_test,y_pred))

Regression Metrics

>>> from sklearn.metrics import mean_absolute_error
>>> mean_absolute_error(y_test,y_pred)
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(y_test,y_pred)
>>> from sklearn.metrics import r2_score
>>> r2_score(y_test, y_pred)

Clustering Metrics

>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_test,y_pred)
>>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_test,y_pred)
>>> from sklearn.metrics import v_measure_score
>>> v_measure_score(y_test,y_pred)

Cross-Validation

  • Evaluate a score by cross-validation
>>> from sklearn.model_selection import cross_val_score
>>> print(cross_val_score(knn, X_train, y_train, cv=4))

Tune Your Model

Finding correct parameter values that will maximize a model’s prediction accuracy.

Grid Search

Exhaustive search over specified parameter values for an estimator. The example below attempts to find the right amount of clusters to specify for knn to maximize the model’s accuracy.

>>> from sklearn.model_selection import GridSearchCV
>>> params = {'n_neighbors': np.arange(1,3), 'metric':['euclidean','cityblock']}
>>> grid = GridSearchCV(estimator = knn, param_grid = params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization

Randomized search on hyperparameters. In contrast to Grid Search, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

>>> from sklearn.model_selection import RandomizedSearchCV
>>> params = {'n_neighbors':range(1,5), 'weights':['uniform','distance']}
>>> rsearch = RandomizedSearchCV(estimator = knn, param_distributions = params, cv = 4, n_iter = 8, random_state = 5)
>>> rseach.fit(X_train, y_train)
>>> print(rsearch.best_score_)

Scikit-learn is an extremely useful library for various machine learning models. The sections above provided a basic step-by-step process to perform an analysis on different models. If you want to learn more however check out the documentation for Scikit-Learn, as there are still plenty of useful functions you could learn.

Join my email list with 5k+ people to get “The Complete Python for Data Science Cheat Sheet Booklet” for FREE

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Placing developers like you at top startups and tech companies


Scikit-Learn Cheat Sheet (2022), Python for Data Science was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


Read more here: Source link