SVM with Univariate Feature Selection in Scikit Learn

Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amount of computational power and is sensitive to the choice of features. This can make the model more complex and harder to interpret.

Univariate feature selection is a method used to select the most important features in a dataset. The idea behind this method is to evaluate each individual feature’s relationship with the target variable and select the ones that have the strongest correlation. This process is repeated for each feature and the best ones are selected based on defined criteria, such as the highest correlation or statistical significance.

In univariate feature selection, the focus is on individual features and their contribution to the target variable, rather than considering the relationships between features. This method is simple and straightforward, but it does not take into account any interactions or dependencies between features.

Univariate feature selection is useful when working with a large number of features and the goal is to reduce the dimensionality of the data and simplify the modeling process. It is also useful for feature selection in cases where the relationship between the target variable and individual features is not complex and can be understood through a simple statistical analysis.

Syntax of  SelectKBest():

Select features according to the k highest scores.

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)

score_fun : In score_fun we can use f_classif, f_regression, chi2, mutual_info_classif,  GenericUnivariateSelect etc, The default is f_classif
used for classification data it takes two arrays X and y, and return a pair of arrays (scores, pvalues) or a single array with scores. 

k : We can assign the integer value denotes the number of features we want or “all”, The default value is 10

ANOVA stands for Analysis of Variance and is a statistical technique used to determine the relationship between a dependent variable (label) and one or more independent variables (features). It measures the variability between different groups of data and helps to identify which independent variable has a significant impact on the dependent variable.

In machine learning, ANOVA is used as a univariate feature selection method between the feature and the label. This means it helps to identify the most important features in a dataset that have the greatest impact on the target variable.

Univariate statistical tests are a class of statistical tests that are used to analyze the distribution of a single variable. The goal of these tests is to determine whether there is significant variation in the variable and to identify any patterns or relationships in the data. Some common univariate statistical tests include:

The F-score, also known as the F-statistic, is a ratio of two variances used in ANOVA. It is calculated as the ratio of the variance between the groups to the variance within the groups. The F-score is used to test the hypothesis that the means of the groups are equal.

Formula:
The F-score can be calculated as follows:
F = (MSB / MSW)
where:
MSB = Mean Square Between (variance between groups)
MSW = Mean Square Within (variance within groups)

The F-score is used to test the null hypothesis, which states that the means of the groups are equal. If the calculated F-score is larger than the critical value from the F-distribution, the null hypothesis is rejected, and it is concluded that there is a significant difference between the means of the groups.

Here’s an example of how ANOVA works in Scikit Learn, which we will use as the score_fun:

f_classif In the first example, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions. 

f_regression : In the second example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.

chi2: This test is used to determine whether there is a significant association between two categorical variables. The test calculates the difference between the expected frequency of occurrences and the observed frequency of occurrences.

EXAMPLE 1 : 

In this article, we will use the iris dataset from the sci-kit-learn library and apply univariate feature selection to the data before training an SVM. The iris dataset contains 150 samples of iris flowers, with four features: sepal length, sepal width, petal length, and petal width. The goal is to use SVM to classify the iris flowers into three different species based on their features.

Step 1: Load the iris dataset and split the data into training and test sets:

Python3

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

  

iris = load_iris(as_frame=True)

df = iris.frame

X = df.drop(['target'], axis = 1)

y = df['target']

  

X_train, X_test, y_train, y_test = train_test_split(X,

                                                    y,

                                                    test_size=0.2,

                                                    random_state=42)

Step 2: Univariate Feature Selection

we will use the SelectKBest class from sklearn.feature_selection module to perform univariate feature selection. 

In this case, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions. 

We will set the k parameter to 2, which means that we will keep the two best features from the dataset.

Python3

from sklearn.feature_selection import SelectKBest, f_classif

  

selector = SelectKBest(f_classif, k=2)

selector.fit(X_train, y_train)

  

print('Number of input features:', selector.n_features_in_)

print('Input features Names  :', selector.feature_names_in_)

print('Input features scores :', selector.scores_)

print('Input features pvalues:', selector.pvalues_)

print('Output features Names :', selector.get_feature_names_out())

Output:

Number of input features: 4
Input features Names  : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
 'petal width (cm)']
Input features scores : [ 84.80836804  41.29284269 925.55642345 680.77560309]
Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]
Output features Names : ['petal length (cm)' 'petal width (cm)']

Now we will select both petal length and petal width by using selector.transform to train and test features.

Python3

X_train_selected = selector.transform(X_train)

X_test_selected = selector.transform(X_test)

Step 3: Apply the Support Vector Machine Classifier to train the model.

Now that we have selected the best two features, we will train an SVM classifier using these features:

Python3

from sklearn.svm import SVC

  

clf = SVC(kernel='linear', C=1, random_state=42)

clf.fit(X_train_selected, y_train)

Step 4: Evaluate the performance of the SVM classifier

Finally, we will evaluate the performance of the SVM classifier by calculating its accuracy on the test set:

Python3

from sklearn.metrics import accuracy_score

  

y_pred = clf.predict(X_test_selected)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Output:

Accuracy: 1.0

This means that the SVM classifier was able to classify 100% of the test samples correctly, using only two features. By reducing the number of features in the model, we have made it simpler and more interpretable, while still achieving good performance.

Full code:

Python3

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

  

iris = load_iris(as_frame=True)

df = iris.frame

X = df.drop(['target'], axis = 1)

y = df['target']

  

X_train, X_test, y_train, y_test = train_test_split(X,

                                                    y,

                                                    test_size=0.2,

                                                    random_state=42)

  

selector = SelectKBest(f_classif, k=2)

selector.fit(X_train, y_train)

  

print('Number of input features:', selector.n_features_in_)

print('Input features Names  :', selector.feature_names_in_)

print('Input features scores :', selector.scores_)

print('Input features pvalues:', selector.pvalues_)

print('Output features Names :', selector.get_feature_names_out())

  

X_train_selected = selector.transform(X_train)

X_test_selected = selector.transform(X_test)

  

clf = SVC(kernel='linear', C=1, random_state=42)

clf.fit(X_train_selected, y_train)

y_pred = clf.predict(X_test_selected)

accuracy = accuracy_score(y_test, y_pred)

print("\n Accuracy:", accuracy)

Output:

Number of input features: 4
Input features Names  : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
 'petal width (cm)']
Input features scores : [ 84.80836804  41.29284269 925.55642345 680.77560309]
Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]
Output features Names : ['petal length (cm)' 'petal width (cm)']

 Accuracy: 1.0

Example 2: 

In this example, we are using the SelectKBest class from sklearn.feature_selection module. The f_regression function is used as the scoring function, which is the ANOVA F-value between the feature and the target. The fit method is used to fit the selector to the data, and the scores_ attribute is used to get the scores for each feature. Finally, we sort the scores and get the names of the top 5 features with the greatest impact on the target variable.

In the first example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.

The value of k determines the number of features that will be selected. In the first example, k=5, so the top 5 features will be selected based on their F-values. In the second example, k=2, so the top 2 features will be selected based on their F-values.

Python3

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split

from sklearn.svm import SVR

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import f_regression

from sklearn.metrics import mean_squared_error

  

data = load_diabetes(as_frame=True)

df = data.frame

  

X = df.drop(['target'], axis = 1)

y = df['target']

  

X_train, X_test, y_train, y_test = train_test_split(X,

                                                    y,

                                                    test_size=0.2,

                                                    random_state=42)

  

selector = SelectKBest(f_regression, k=3)

  

selector.fit(X_train, y_train)

  

print('Number of input features:', selector.n_features_in_)

print('Input features Names  :', selector.feature_names_in_)

print('Input features scores :', selector.scores_)

print('Input features pvalues:', selector.pvalues_)

print('Output features Names :', selector.get_feature_names_out())

  

X_train_selected = selector.transform(X_train)

X_test_selected = selector.transform(X_test)

  

reg = SVR(kernel='rbf')

reg.fit(X_train_selected, y_train)

y_pred = reg.predict(X_test_selected)

mse = mean_squared_error(y_test, y_pred)

print("\Mean Squared Error :", mse)

Output:

Number of input features: 10
Input features Names  : ['age' 'sex' 'bmi' 'bp' 's1' 's2' 's3' 's4' 's5' 's6']
Input features scores : [1.40986700e+01 1.77755064e-02 2.02386965e+02 8.65580384e+01
 1.45561098e+01 8.63143031e+00 6.07087750e+01 7.74171182e+01 1.53967806e+02 6.31023038e+01]
Input features pvalues: [2.02982942e-04 8.94012908e-01 1.39673719e-36 1.49839640e-18
 1.60730187e-04 3.52250747e-03 7.56195523e-14 6.36582277e-17 1.45463546e-29 2.69104622e-14]
Output features Names : ['bmi' 'bp' 's5']
\Mean Squared Error : 3668.63356096246

In conclusion, univariate feature selection is a useful technique for reducing the complexity of SVM models.

Read more here: Source link