This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

5. Putting it all together

5.1. Pipelining

We have seen that some estimators can transform data, and some estimators can predict variables. We can create combined estimators:

tutorial/statistical_inference/pca_digits_spectrum.png
>>> from scikits.learn import linear_model, decomposition, datasets

>>> logistic = linear_model.LogisticRegression()
>>> pca = decomposition.PCA()
>>> from scikits.learn.pipeline import Pipeline
>>> pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> pca.fit(X_digits, y_digits)
PCA(copy=True, n_components=None, whiten=False)
>>> pl.plot(pca.explained_variance_) 
[<matplotlib.lines.Line2D object at ...>]

Parameters of pipelines can be set using ‘__’ separated parameter names:

>>> pipe._set_params(pca__n_components=30)
Pipeline(steps=[('pca', PCA(copy=True, n_components=30, whiten=False)), ('logistic', LogisticRegression(C=1.0, dual=False, fit_intercept=True, intercept_scaling=1,
          penalty='l2', tol=0.0001))])
>>> pca.n_components
30

>>> from scikits.learn.grid_search import GridSearchCV
>>> n_components = [10, 15, 20, 30, 40, 50, 64]
>>> Cs = np.logspace(-4, 4, 16)
>>> estimator = GridSearchCV(pipe,
...                          dict(pca__n_components=n_components,
...                               logistic__C=Cs),
...                          n_jobs=-1)
>>> estimator.fit(X_digits, y_digits) 
GridSearchCV(cv=None,...

5.2. Face recognition with eigenfaces

The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, aka LFW:

prediction eigenfaces
Prediction Eigenfaces

Expected results for the top 5 most represented people in the dataset:

                   precision    recall  f1-score   support

Gerhard_Schroeder       0.91      0.75      0.82        28
  Donald_Rumsfeld       0.84      0.82      0.83        33
       Tony_Blair       0.65      0.82      0.73        34
     Colin_Powell       0.78      0.88      0.83        58
    George_W_Bush       0.93      0.86      0.90       129

      avg / total       0.86      0.84      0.85       282

5.3. Open problem: stock market structure

Can we predict the variation in stock prices for Google?