This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

2.4.4. Exercises

To do the exercises, copy the content of the ‘skeletons’ folder as a new folder named ‘workspace’:

% cp -r skeletons workspace

You can then edit the content of the workspace without fear of loosing the original exercise instructions.

Then fire an ipython shell and run the work-in-progress script with:

[1] %run workspace/exercise_XX_script.py arg1 arg2 arg3

If an exception is triggered, use %debug to fire-up a post mortem ipdb session.

Refine the implementation and iterate until the exercise is solved.

For each exercise, the skeleton file provides all the necessary import statements, boilerplate code to load the data and sample code to evaluate the predictive accurracy of the model.

2.4.4.1. Exercise 0: Digits recognition

  • Train a Support Vector Machine classifier.
  • Evaluate the performance using cross validation.
  • Change the value of C and gamma to see the impact on the generalization performance.

ipython command line:

%run workspace/exercise_00_digits.py

2.4.4.2. Exercise 1: Language identification

  • Write a text classification pipeline using 3-grams of characters on data from Wikipedia articles as training set.
  • Evaluate the performance on some held out test set.

ipython command line:

%run workspace/exercise_01_language_train_model.py data/languages/paragraphs/

2.4.4.3. Exercise 2: Sentiment Analysis on movie reviews

  • Write a text classification pipeline to classify movie reviews as either positive or negative.
  • Find a good set of parameters using grid search.
  • Evaluate the performance on a held out test set.
  • Display the most discriminative features for the each class.

ipython command line:

%run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/

2.4.4.4. Exercise 3: Unsupervised topic extraction

  • Train Non Negative Matrix Factorization model on the movie review dataset to extract the 10 main topics.
  • Display the most important words for each topic.

ipython command line:

%run workspace/exercise_03_topic.py data/movie_reviews/txt_sentoken/

2.4.4.5. Exercise 4: CLI text classification utility

Using the results of the previous exercises and the cPickle module of the standard library, write a command line utility that detects the language of some text provided on stdin and estimate the polarity (positive or negative) if the text is written in English.

Bonus point if the utility is able to give a confidence level for its predictions.

Note

While python cPickle works it is recommended to use sklearn.external.joblib optimized pickler for large models.