.. _ta_tutorial_exercises:

Exercises
=========

To do the exercises, copy the content of the 'skeletons' folder as
a new folder named 'workspace'::

  % cp -r skeletons workspace

You can then edit the content of the workspace without fear of loosing
the original exercise instructions.

Then fire an ipython shell and run the work-in-progress script with::

  [1] %run workspace/exercise_XX_script.py arg1 arg2 arg3

If an exception is triggered, use ``%debug`` to fire-up a post
mortem ipdb session.

Refine the implementation and iterate until the exercise is solved.

**For each exercise, the skeleton file provides all the necessary import
statements, boilerplate code to load the data and sample code to evaluate
the predictive accurracy of the model.**


.. _digits_exercise:

Exercise 0: Digits recognition
------------------------------

- Train a Support Vector Machine classifier.

- Evaluate the performance using cross validation.

- Change the value of C and gamma to see the impact on the generalization
  performance.

ipython command line::

  %run workspace/exercise_00_digits.py


.. _language_id_exercise:

Exercise 1: Language identification
-----------------------------------

- Write a text classification pipeline using 3-grams of characters
  on data from Wikipedia articles as training set.

- Evaluate the performance on some held out test set.

ipython command line::

  %run workspace/exercise_01_language_train_model.py data/languages/paragraphs/


.. _sentiment_analysis_exercise:

Exercise 2: Sentiment Analysis on movie reviews
-----------------------------------------------

- Write a text classification pipeline to classify movie reviews as either
  positive or negative.

- Find a good set of parameters using grid search.

- Evaluate the performance on a held out test set.

- Display the most discriminative features for the each class.

ipython command line::

  %run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/


.. _topic_extraction_exercise:

Exercise 3: Unsupervised topic extraction
-----------------------------------------

- Train Non Negative Matrix Factorization model on the movie review
  dataset to extract the 10 main topics.

- Display the most important words for each topic.

ipython command line::

  %run workspace/exercise_03_topic.py data/movie_reviews/txt_sentoken/


Exercise 4: CLI text classification utility
-------------------------------------------

Using the results of the previous exercises and the ``cPickle``
module of the standard library, write a command line utility that
detects the language of some text provided on ``stdin`` and estimate
the polarity (positive or negative) if the text is written in
English.

Bonus point if the utility is able to give a confidence level for its
predictions.

.. note:: While python cPickle works it is recommended to use
  ``sklearn.external.joblib`` optimized pickler for large models.