2.4.5. How to select the right algorithm for the task¶
To conclude this session here are some practical hints for selecting the right algorithm when facing a practical problem.
- If the data is high dimensional and sparse (text data), most of the time linear classifiers with a bit of regularization will work well.
- If the data is dense, low to medium dimensional: try to further reduce the dimensionality with PCA for instance and try both linear and non linear models (e.g. SVC with RBF kernel).
- SVC with gaussian RBF kernel and KMeans clustering can benefit a lot from data normalization either with feature feature with scaling using Scaler or doing whitening with RandomizedPCA with whiten=True. Try various values for n_components with grid search to be sure no to truncate the data too hard.
- There is no free lunch: the best algorithm is data-dependent. If you try many different models, reserve a held out evaluation set that is not used during the model selection process.
A comprehensive practical guide / FAQ / Howto is under work. Stay tuned!