class: center, middle # Introduction to scikit-learn ## Predictive modeling in Python Olivier Grisel .affiliations[ ![Inria](images/inria-logo.png) ![scikit-learn](images/scikit-learn-logo.png) ] Slides: [ogrisel.github.io/decks/2017_intro_sklearn](http://ogrisel.github.io/decks/2017_intro_sklearn) --- # Agenda .middlebelowheader[ ### Machine Learning refresher ### Scikit-learn ### Where do predictive models fit? ] --- # Predictive Modeling 101 .middlebelowheader[ ### Make predictions of outcome of repeated events ### Extract the structure of historical records ### Statistical tools to summarize the training data into an executable model ### Alternative to hard-coded rules written by experts ] --- background-image: url(images/real-estate-1.png) background-size: contain --- background-image: url(images/real-estate-2.png) background-size: contain --- background-image: url(images/real-estate-3.png) background-size: contain --- background-image: url(images/real-estate-4.png) background-size: contain --- background-image: url(images/01-predictive-modeling-flow.png) background-size: contain --- background-image: url(images/02-predictive-modeling-flow.png) background-size: contain --- background-image: url(images/03-small-data-xl.png) background-size: contain --- background-image: url(images/04-small-medium-python.png) background-size: contain --- background-image: url(images/05-small-medium-python-pandas.png) background-size: contain --- background-image: url(images/06-small-medium-pandas-sklearn.png) background-size: contain --- background-image: url(images/07-big-hive.png) background-size: contain --- background-image: url(images/08-big-hive-python.png) background-size: contain --- background-image: url(images/09-big-redshift-python.png) background-size: contain --- background-image: url(images/10-big-spark-spark.png) background-size: contain --- background-image: url(images/predictive-modeling-examples.png) background-size: contain --- class: center, middle, bgheader background-color: rgb(30, 100, 0) # Scikit-learn --- .center[ ![scikit-learn](images/scikit-learn-logo.png) ] .middlebelowheader[ ### Library of Machine Learning algorithms ### Open Source project ### Python / NumPy / SciPy / Cython ### Simple **fit** / **predict** / **transform** API ### Model Assessment, Selection, Ensembles ] --- background-image: url(images/sklearn-flow-1.png) background-size: contain --- background-image: url(images/sklearn-flow-2.png) background-size: contain --- background-image: url(images/sklearn-flow-3.png) background-size: contain --- # Support Vector Machine .middlebelowheader.medium[ ```python from sklearn.svm import SVC model = SVC(kernel="rbf", C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted) ``` ] --- # Linear Classifier .middlebelowheader.medium[ ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression(C=1, penalty='l1') model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted) ``` ] --- # Random Forest .middlebelowheader.medium[ ```python from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted) ``` ] --- background-image: url(images/classifier_comparison.png) background-size: contain --- background-image: url(images/scikit-learn.org.png) background-size: contain --- class: center, middle, bgheader background-image: url(images/forest-9026372290_ffed331779_k.jpg) background-size: cover # Where do predictive models fit? --- background-image: url(images/11-big-data-archictecture.png) background-size: contain --- background-image: url(images/12-big-data-tools-mix-1.png) background-size: contain --- background-image: url(images/13-big-data-tools-mix-2.png) background-size: contain --- class: center, middle, bgheader background-image: url(images/racks-8533890844_02fa24474d_o.jpg) background-size: cover # Scaling predictive modeling --- # The need for scaling out in ML .middlebelowheader[ ### I/O intensive feature engineering and model scoring #### Loading, filtering, joining, aggregating: SQL-land #### Click log ⇒ Session features ⇒ User activity features ### CPU intensive model fitting #### Hyper-parameter search and cross-validation #### Gradient Boosting, Random Forests, Large Neural Networks ] --- # Limitations of PySpark .middlebelowheader[ ### Python driver -> Scala / JVM -> Python worker #### Latency induced by the networked architecture #### Complex traceback / errors for non-scala developers ### No pure Python local mode for PySpark #### Impossible to use profiler or ipdb for inner calls ] --- # dask .middlebelowheader[ - Collection API similar to NumPy array and Pandas DataFrame objects - Custom workloads via tasks scheduling API - Pure python and low overhead - **Scales up**: runs on clusters with 1000s of cores - **Scales down**: runs on a laptop in a single process - Experimental integration with sklearn: [dask-ml](http://dask-ml.readthedocs.io) ] --- class: singleimg # dask bags & compute graphs ```python >>> import dask.bag as db >>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz') .map(json.loads) .map(lambda d: d['type'] == 'PushEvent') .count() ``` ![dask embarrasing](images/dask-bag-embarassing.png) --- class: singleimg # dask arrays & compute graphs ```python >>> import dask.array as da >>> x = da.ones((5000, 1000), chunks=(1000, 1000)) >>> u, s, v = da.linalg.svd(x) ``` ![dask svd](images/dask-svd.png) --- # Demo .middlebelowheader[ https://github.com/ogrisel/docker-distributed Have a look a the notebooks in the `examples` folder. .center[
https://youtu.be/6mKNSEQ0FIQ - [Local video](videos/distributed_demo_with_dask.ogv) ] ] --- # Dask + Distributed limitations .middlebelowheader[ ### Younger project (under very active development) ### `dask.dataframe` ⊂ `pandas.DataFrame` or PySpark #### Simple design: e.g. no predicate push-down in dataframe #### dask-scheduler is a single point of failure #### Not meant for multi-tenancy (yet?) #### Scikit-learn integration: experimental ] --- class: center, middle, bgheader background-color: rgb(30, 100, 0) # Conclusion --- class: middle, center, singleimg ![Changing stuff and see what happens](images/changing-stuff-models.png) --- # Secrets of the success of Python (& R) in Data Science .middlebelowheader[ ### Iterative exploration with built-in plotting tools ### Low latency of single host in-memory computing ### Easy to install, easy to teach: no-sysadmin required ### Rich ecosystem of libraries ] --- # Conclusion .middlebelowheader[ ### Scikit-learn is a versatile ML toolkit ### with NumPy and pandas for feature engineering ### with Jupyter and matplotlib for interactive data exploration ### PyData moving towards big data / compute for (e.g. dask and xarray) ] --- class: middle # Thank you for your attention! - https://scikit-learn.org - Slides: [ogrisel.github.io/decks/2017_intro_sklearn](http://ogrisel.github.io/decks/2017_intro_sklearn) - @ogrisel on twitter --- class: center, middle, bgheader background-image: url(images/floor-2249694239_c27a3ea043_o.jpg) background-size: cover # Scaling feature engineering with MPP --- class: bunchoflogos # Massively Parallel Processing .middlebelowheader[ ![AWS Redshift](images/redshift-logo.png) ![Google BigQuery](images/bigquery-logo.png) ![Apache Impala](images/impala-logo.png) ![Citus](images/citusdata-logo.png) ![Presto](images/prestodb-logo.png) ] --- # Problem with the use of SQL in MPP .middlebelowheader[ ## Great for ad-hoc queries but no plotting ## SQL strings in Python code is sad :( ### Not easy to write test / CI ## Standard ORM not a solution ] --- # MPP in Python .middlebelowheader[ ### `blaze` & `ibis` provide 'dataframe' like Python API ### Under the hood it can generate SQL for MPP engines ### `blaze` also targets non-SQL backends (pandas, MongoDB, PySpark...) ] --- class: middle, mmedium ```python >>> import blaze as bz >>> iris = bz.Data('postgresql://localhost::iris') >>> iris sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa >>> bz.by(iris.species, smallest=iris.petal_length.min(), ... largest=iris.petal_length.max()) species largest smallest 0 Iris-setosa 1.9 1.0 1 Iris-versicolor 5.1 3.0 2 Iris-virginica 6.9 4.5 ``` --- Background image credits - https://www.flickr.com/photos/jemimus/8533890844/ - https://www.flickr.com/photos/antcaz/2249694239/ - https://www.flickr.com/photos/benjamine-s/14004414605 - https://www.flickr.com/photos/a-herzog/9026372290