Contents

3.5. Feature selection

3.5.1. Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can seen as a preprocessing step to an estimator. The scikit.learn exposes feature selection routines a objects that implement the transform method. The k-best features can be selected based on:

scikits.learn.feature_selection.univariate_selection.SelectKBest(score_func, k=10)

Filter : Select the k lowest p-values

or by setting a percentile of features to keep using

scikits.learn.feature_selection.univariate_selection.SelectPercentile(score_func, percentile=10)

Filter : Select the best percentile of the p_values

or using common statistical quantities:

scikits.learn.feature_selection.univariate_selection.SelectFpr(score_func, alpha=0.050000000000000003)

Filter : Select the pvalues below alpha

scikits.learn.feature_selection.univariate_selection.SelectFdr(score_func, alpha=0.050000000000000003)

Filter : Select the p-values corresponding to an estimated false discovery rate of alpha. This uses the Benjamini-Hochberg procedure

scikits.learn.feature_selection.univariate_selection.SelectFwe(score_func, alpha=0.050000000000000003)

Filter : Select the p-values corresponding to a corrected p-value of alpha

These objects take as input a scoring function that returns univariate p-values.

3.5.1.1. Feature scoring functions

Warning

Beware not to use a regression scoring function with a classification problem.

3.5.1.1.1. For classification

scikits.learn.feature_selection.univariate_selection.f_classif(X, y)

Compute the Anova F-value for the provided sample

Parameters :

X : array of shape (n_samples, n_features)

the set of regressors sthat will tested sequentially

y : array of shape(n_samples)

the data matrix

Returns :

F : array of shape (m),

the set of F values

pval : array of shape(m),

the set of p-values

3.5.1.1.2. For regression

scikits.learn.feature_selection.univariate_selection.f_regression(X, y, center=True)

Quick linear model for testing the effect of a single regressor, sequentially for many regressors This is done in 3 steps: 1. the regressor of interest and the data are orthogonalized wrt constant regressors 2. the cross correlation between data and regressors is computed 3. it is converted to an F score then to a p-value

Parameters :

X : array of shape (n_samples, n_features)

the set of regressors sthat will tested sequentially

y : array of shape(n_samples)

the data matrix

center : True, bool,

If true, X and y are centered

Returns :

F : array of shape (m),

the set of F values

pval : array of shape(m)

the set of p-values