Title

class: center, middle

# Introduction to scikit-learn
## Predictive modeling in Python

Olivier Grisel

.affiliations[
  ![Inria](images/inria-logo.png)
  ![scikit-learn](images/scikit-learn-logo.png)
]

Slides: [ogrisel.github.io/decks/2017_intro_sklearn](http://ogrisel.github.io/decks/2017_intro_sklearn)

---
# Agenda

.middlebelowheader[

### Machine Learning refresher

### Scikit-learn

### Where do predictive models fit?

]

---
# Predictive Modeling 101

.middlebelowheader[

### Make predictions of outcome of repeated events

### Extract the structure of historical records

### Statistical tools to summarize the training data into an executable model

### Alternative to hard-coded rules written by experts

]

---
background-image: url(images/real-estate-1.png)
background-size: contain

---
background-image: url(images/real-estate-2.png)
background-size: contain

---
background-image: url(images/real-estate-3.png)
background-size: contain

---
background-image: url(images/real-estate-4.png)
background-size: contain

---
background-image: url(images/01-predictive-modeling-flow.png)
background-size: contain

---
background-image: url(images/02-predictive-modeling-flow.png)
background-size: contain

---
background-image: url(images/03-small-data-xl.png)
background-size: contain

---
background-image: url(images/04-small-medium-python.png)
background-size: contain

---
background-image: url(images/05-small-medium-python-pandas.png)
background-size: contain

---
background-image: url(images/06-small-medium-pandas-sklearn.png)
background-size: contain

---
background-image: url(images/07-big-hive.png)
background-size: contain

---
background-image: url(images/08-big-hive-python.png)
background-size: contain

---
background-image: url(images/09-big-redshift-python.png)
background-size: contain

---
background-image: url(images/10-big-spark-spark.png)
background-size: contain

---
background-image: url(images/predictive-modeling-examples.png)
background-size: contain

---
class: center, middle, bgheader
background-color: rgb(30, 100, 0)

# Scikit-learn

---
.center[
![scikit-learn](images/scikit-learn-logo.png)
]

.middlebelowheader[

### Library of Machine Learning algorithms

### Open Source project

### Python / NumPy / SciPy / Cython

### Simple **fit** / **predict** / **transform** API

### Model Assessment, Selection, Ensembles

]

---
background-image: url(images/sklearn-flow-1.png)
background-size: contain

---
background-image: url(images/sklearn-flow-2.png)
background-size: contain

---
background-image: url(images/sklearn-flow-3.png)
background-size: contain

---
# Support Vector Machine

.middlebelowheader.medium[
```python
from sklearn.svm import SVC

model = SVC(kernel="rbf", C=1.0, gamma=1e-4)

model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

from sklearn.metrics import f1_score
f1_score(y_test, y_predicted)
```
]

---
# Linear Classifier

.middlebelowheader.medium[
```python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1, penalty='l1')

model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

from sklearn.metrics import f1_score
f1_score(y_test, y_predicted)
```
]

---
# Random Forest

.middlebelowheader.medium[
```python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200)

model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

from sklearn.metrics import f1_score
f1_score(y_test, y_predicted)
```
]

---
background-image: url(images/classifier_comparison.png)
background-size: contain

---
background-image: url(images/scikit-learn.org.png)
background-size: contain

---
class: center, middle, bgheader
background-image: url(images/forest-9026372290_ffed331779_k.jpg)
background-size: cover

# Where do predictive models fit?

---
background-image: url(images/11-big-data-archictecture.png)
background-size: contain

---
background-image: url(images/12-big-data-tools-mix-1.png)
background-size: contain

---
background-image: url(images/13-big-data-tools-mix-2.png)
background-size: contain

---
class: center, middle, bgheader
background-image: url(images/racks-8533890844_02fa24474d_o.jpg)
background-size: cover

# Scaling predictive modeling

---
# The need for scaling out in ML

.middlebelowheader[

### I/O intensive feature engineering and model scoring

#### Loading, filtering, joining, aggregating: SQL-land

#### Click log ⇒ Session features ⇒ User activity features

### CPU intensive model fitting

#### Hyper-parameter search and cross-validation

#### Gradient Boosting, Random Forests, Large Neural Networks

]

---
# Limitations of PySpark

.middlebelowheader[

### Python driver -> Scala / JVM -> Python worker

#### Latency induced by the networked architecture

#### Complex traceback / errors for non-scala developers

### No pure Python local mode for PySpark

#### Impossible to use profiler or ipdb for inner calls

]

---
# dask

.middlebelowheader[

- Collection API similar to NumPy array and Pandas DataFrame objects

- Custom workloads via tasks scheduling API

- Pure python and low overhead

- **Scales up**: runs on clusters with 1000s of cores

- **Scales down**: runs on a laptop in a single process

- Experimental integration with sklearn: [dask-ml](http://dask-ml.readthedocs.io)

]

---
class: singleimg
# dask bags & compute graphs

```python
>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
          .map(json.loads)
          .map(lambda d: d['type'] == 'PushEvent')
          .count()
```

![dask embarrasing](images/dask-bag-embarassing.png)

---
class: singleimg
# dask arrays & compute graphs

```python
>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)
```

![dask svd](images/dask-svd.png)

---
# Demo

.middlebelowheader[

https://github.com/ogrisel/docker-distributed

Have a look a the notebooks in the `examples` folder.

.center[

https://youtu.be/6mKNSEQ0FIQ - [Local video](videos/distributed_demo_with_dask.ogv)
]
]

---
# Dask + Distributed limitations

.middlebelowheader[

### Younger project (under very active development)

### `dask.dataframe` ⊂ `pandas.DataFrame` or PySpark

#### Simple design: e.g. no predicate push-down in dataframe

#### dask-scheduler is a single point of failure

#### Not meant for multi-tenancy (yet?)

#### Scikit-learn integration: experimental
]

---
class: center, middle, bgheader
background-color: rgb(30, 100, 0)

# Conclusion

---
class: middle, center, singleimg

![Changing stuff and see what happens](images/changing-stuff-models.png)

---
# Secrets of the success of Python (& R) in Data Science

.middlebelowheader[

### Iterative exploration with built-in plotting tools

### Low latency of single host in-memory computing

### Easy to install, easy to teach: no-sysadmin required

### Rich ecosystem of libraries

]

---
# Conclusion

.middlebelowheader[

### Scikit-learn is a versatile ML toolkit

### with NumPy and pandas for feature engineering

### with Jupyter and matplotlib for interactive data exploration

### PyData moving towards big data / compute for (e.g. dask and xarray)

]

---
class: middle

# Thank you for your attention!

- https://scikit-learn.org

- Slides: [ogrisel.github.io/decks/2017_intro_sklearn](http://ogrisel.github.io/decks/2017_intro_sklearn)

- @ogrisel on twitter

---
class: center, middle, bgheader
background-image: url(images/floor-2249694239_c27a3ea043_o.jpg)
background-size: cover

# Scaling feature engineering with MPP

---
class: bunchoflogos
# Massively Parallel Processing

.middlebelowheader[

![AWS Redshift](images/redshift-logo.png)
![Google BigQuery](images/bigquery-logo.png)

![Apache Impala](images/impala-logo.png)
![Citus](images/citusdata-logo.png)
![Presto](images/prestodb-logo.png)

]

---
# Problem with the use of SQL in MPP

.middlebelowheader[

## Great for ad-hoc queries but no plotting

## SQL strings in Python code is sad :(

### Not easy to write test / CI

## Standard ORM not a solution

]

---
# MPP in Python

.middlebelowheader[

### `blaze` & `ibis` provide 'dataframe' like Python API

### Under the hood it can generate SQL for MPP engines

### `blaze` also targets non-SQL backends (pandas, MongoDB, PySpark...)

]

---
class: middle, mmedium

```python
>>> import blaze as bz
>>> iris = bz.Data('postgresql://localhost::iris')
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa

>>> bz.by(iris.species, smallest=iris.petal_length.min(),
...                      largest=iris.petal_length.max())
           species  largest  smallest
0      Iris-setosa      1.9       1.0
1  Iris-versicolor      5.1       3.0
2   Iris-virginica      6.9       4.5
```

---
Background image credits

- https://www.flickr.com/photos/jemimus/8533890844/

- https://www.flickr.com/photos/antcaz/2249694239/

- https://www.flickr.com/photos/benjamine-s/14004414605

- https://www.flickr.com/photos/a-herzog/9026372290