2.4.1. Tutorial setup¶
The following assumes you have extracted the source distribution of this tutorial somewhere on your local disk. Alternatively you can use git to clone this repo directly from github onto your local disk:
% git clone https://github.com/scikit-learn/scikit-learn.git
% cd doc/tutorial/text_analytics
In the following we will name this folder $TUTORIAL_HOME. It should contain the following folders:
- data - folder to put the datasets used during the tutorial
- skeletons - sample incomplete scripts for the exercices
- solutions - solutions of the exercices
You can aleardy copy the skeletons into a new folder named workspace where you will edit your own files for the exercices while keeping the original skeletons intact:
% cp -r skeletons workspace
2.4.1.1. Install scikit-learn build dependencies¶
Please refer to the scikit-learn install page for per-system instructions.
You must have numpy, scipy, matplotlib and ipython installed:
Under Debian or Ubuntu Linux you should use:
% sudo apt-get install build-essential python-dev python-numpy \ python-numpy-dev python-scipy libatlas-dev g++ python-matplotlib \ ipythonUnder MacOSX you should probably use a scientific python distribution such as Scipy Superpack
Under Windows the Python(x,y) is probably your best bet to get a working numpy / scipy environment up and running.
Alternatively under Windows and MaxOSX you can use the EPD (Enthought Python Distribution) which is a (non-open source) packaging of the scientific python stack.
2.4.1.2. Build scikit-learn from source¶
Here are the instructions to install the current master from source on a POSIX system (e.g. Linux and MacOSX):
% git clone https://github.com/scikit-learn/scikit-learn.git
% cd scikit-learn
You can then build it locally and install this working directory as an “editable” python package:
% python setup.py build_ext -i
% pip install -e .
Alternatively you can install the library globally (or in a virtualenv):
% python setup.py build
% sudo python setup.py install
You should also be able to launch the tests from anywhere in the system (if nose is installed) with the following:
% nosetests sklearn
The output should end with OK as in:
----------------------------------------------------------------------
Ran 589 tests in 36.876s
OK (SKIP=2)
If this is not the case please send a mail to the scikit-learn mailing list including the error messages along with the version number of all the afore mentioned dependencies and your operating system.
In the rest of the tutorial, the path to the scikit-learn source folder will be named $SKL_HOME.
As usual building from source under Windows is slightly more complicated. Checkout the build instructions on the scikit-learn website.
2.4.1.3. Download the datasets¶
Machine Learning algorithms need data. Go to each $TUTORIAL_HOME/data sub-folder and run the fetch_data.py script from there (after having read them first).
For instance:
% cd $TUTORIAL_HOME/data/languages
% less fetch_data.py
% python fetch_data.py