3.9. Feature selection¶

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

3.9.1. Univariate feature selection¶

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can seen as a preprocessing step to an estimator. The scikit.learn exposes feature selection routines a objects that implement the transform method. The k-best features can be selected based on:

sklearn.feature_selection.univariate_selection.SelectKBest(score_func, k=10)¶: Filter : Select the k lowest p-values

or by setting a percentile of features to keep using

sklearn.feature_selection.univariate_selection.SelectPercentile(score_func, percentile=10)¶: Filter : Select the best percentile of the p_values

or using common univariate statistical test for each feature:

sklearn.feature_selection.univariate_selection.SelectFpr(score_func, alpha=0.050000000000000003)¶: Filter : Select the pvalues below alpha based on a FPR test: False Positive Rate: controlling the total amount of false detections.

sklearn.feature_selection.univariate_selection.SelectFdr(score_func, alpha=0.050000000000000003)¶: Filter : Select the p-values corresponding to an estimated false discovery rate of alpha. This uses the Benjamini-Hochberg procedure

sklearn.feature_selection.univariate_selection.SelectFwe(score_func, alpha=0.050000000000000003)¶: Filter : Select the p-values corresponding to Family-wise error rate: a corrected p-value of alpha

These objects take as input a scoring function that returns univariate p-values.

Examples:

Univariate Feature Selection

3.9.1.1. Feature scoring functions¶

Warning

Beware not to use a regression scoring function with a classification problem.

3.9.1.1.1. For classification¶

sklearn.feature_selection.univariate_selection.chi2(X, y)¶

Compute χ² (chi-squared) statistic for each class/feature combination.

This transformer can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from either boolean or multinomially distributed data (e.g., term counts in document classification) relative to the classes.

Recall that the χ² statistic measures dependence between stochastic variables, so a transformer based on this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

Parameters :

X : {array-like, sparse matrix}, shape = [n_samples, n_features_in]

Sample vectors.

y : array-like, shape = n_samples

Target vector (class labels).

sklearn.feature_selection.univariate_selection.f_classif(X, y)¶

Compute the Anova F-value for the provided sample

Parameters :

X : array of shape (n_samples, n_features)

the set of regressors sthat will tested sequentially

y : array of shape(n_samples)

the data matrix

Returns :

F : array of shape (m),

the set of F values

pval : array of shape(m),

the set of p-values

Feature selection with sparse data

If you use sparse data (i.e. data represented as sparse matrices), only chi2 will deal with the data without making it dense.

3.9.1.1.2. For regression¶

sklearn.feature_selection.univariate_selection.f_regression(X, y, center=True)¶

Quick linear model for testing the effect of a single regressor, sequentially for many regressors This is done in 3 steps: 1. the regressor of interest and the data are orthogonalized wrt constant regressors 2. the cross correlation between data and regressors is computed 3. it is converted to an F score then to a p-value

Parameters :

X : array of shape (n_samples, n_features)

the set of regressors sthat will tested sequentially

y : array of shape(n_samples)

the data matrix

center : True, bool,

If true, X and y are centered

Returns :

F : array of shape (m),

the set of F values

pval : array of shape(m)

the set of p-values

3.9.2. Recursive feature elimination¶

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Examples:

Recursive feature elimination: A recursive feature elimination example showing the relevance of pixels in a digit classification task.
Recursive feature elimination with cross-validation: A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation.

3.9.3. L1-based feature selection¶

Linear models penalized with the L1 norm have sparse solutions. When the goal is to reduce the dimensionality of the data to use with another classifier, the transform method of LogisticRegression and LinearSVC can be used:

>>> from sklearn import datasets
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = LinearSVC(C=1, penalty="l1", dual=False).fit_transform(X, y)
>>> X_new.shape
(150, 2)

The parameter C controls the sparsity: the smaller the fewer features.

Examples:

Classification of text documents using sparse features: Comparison of different algorithms for document classification including L1-based feature selection.

This page

Citing

3.9. Feature selection¶

3.9.1. Univariate feature selection¶

3.9.1.1. Feature scoring functions¶

3.9.1.1.1. For classification¶

3.9.1.1.2. For regression¶

3.9.2. Recursive feature elimination¶

3.9.3. L1-based feature selection¶