This documentation is for scikit-learn version 0.11-gitOther versions


If you use the software, please consider citing scikit-learn.

This page

2.4.5. How to select the right algorithm for the task

To conclude this session here are some practical hints for selecting the right algorithm when facing a practical problem.

  • If the data is high dimensional and sparse (text data), most of the time linear classifiers with a bit of regularization will work well.
  • If the data is dense, low to medium dimensional: try to further reduce the dimensionality with PCA for instance and try both linear and non linear models (e.g. SVC with RBF kernel).
  • SVC with gaussian RBF kernel and KMeans clustering can benefit a lot from data normalization either with feature feature with scaling using Scaler or doing whitening with RandomizedPCA with whiten=True. Try various values for n_components with grid search to be sure no to truncate the data too hard.
  • There is no free lunch: the best algorithm is data-dependent. If you try many different models, reserve a held out evaluation set that is not used during the model selection process.

A comprehensive practical guide / FAQ / Howto is under work. Stay tuned!