Previous
4. Unsupervised ... 4. Unsupervised learning: seeking representations of the data

Next
6. Finding help 6. Finding help

Up
2.2. Scikit-lear... 2.2. Scikit-learn tutorial: statistical-learning for sientific data processing

This documentation is for scikit-learn version 0.11-git — Other versions

Citing

If you use the software, please consider citing scikit-learn.

This page

5. Putting it all together

5. Putting it all together¶

5.1. Pipelining¶

We have seen that some estimators can transform data, and some estimators can predict variables. We can create combined estimators:

tutorial/statistical_inference/pca_digits_spectrum.png

>>> from scikits.learn import linear_model, decomposition, datasets

>>> logistic = linear_model.LogisticRegression()
>>> pca = decomposition.PCA()
>>> from scikits.learn.pipeline import Pipeline
>>> pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> pca.fit(X_digits, y_digits)
PCA(copy=True, n_components=None, whiten=False)
>>> pl.plot(pca.explained_variance_) 
[<matplotlib.lines.Line2D object at ...>]

Parameters of pipelines can be set using ‘__’ separated parameter names:

>>> pipe._set_params(pca__n_components=30)
Pipeline(steps=[('pca', PCA(copy=True, n_components=30, whiten=False)), ('logistic', LogisticRegression(C=1.0, dual=False, fit_intercept=True, intercept_scaling=1,
          penalty='l2', tol=0.0001))])
>>> pca.n_components
30

>>> from scikits.learn.grid_search import GridSearchCV
>>> n_components = [10, 15, 20, 30, 40, 50, 64]
>>> Cs = np.logspace(-4, 4, 16)
>>> estimator = GridSearchCV(pipe,
...                          dict(pca__n_components=n_components,
...                               logistic__C=Cs),
...                          n_jobs=-1)
>>> estimator.fit(X_digits, y_digits) 
GridSearchCV(cv=None,...

5.2. Face recognition with eigenfaces¶

The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, aka LFW:

http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)


Prediction	Eigenfaces

Expected results for the top 5 most represented people in the dataset:

                   precision    recall  f1-score   support

Gerhard_Schroeder       0.91      0.75      0.82        28
  Donald_Rumsfeld       0.84      0.82      0.83        33
       Tony_Blair       0.65      0.82      0.73        34
     Colin_Powell       0.78      0.88      0.83        58
    George_W_Bush       0.93      0.86      0.90       129

      avg / total       0.86      0.84      0.85       282

5.3. Open problem: stock market structure¶

Can we predict the variation in stock prices for Google?