============================================================
Unsupervised learning: seeking representations of the data
============================================================

Clustering: grouping observations together
============================================

.. topic:: The problem solved in clustering

    Given the iris dataset, if we knew that there were 3 types of iris, but
    did not have access to a taxonomist to label them: we could try a
    **clustering task**: split the observations in well-separated group
    called *clusters*.

..   
   >>> # Set the PRNG   
   >>> import numpy as np
   >>> np.random.seed(1)
   >>> # Import pylab
   >>> import pylab as pl


K-means clustering
-------------------

Note that their exists many different clustering criteria and associated
algorithm. The simplest clustering algorithm is the k-means.

.. image:: k_means_iris_3.png
    :scale: 70
    :align: right


:: 

    >>> from scikits.learn import cluster, datasets
    >>> iris = datasets.load_iris()
    >>> X_iris = iris.data
    >>> y_iris = iris.target

    >>> k_means = cluster.KMeans(k=3)
    >>> k_means.fit(X_iris) # doctest: +ELLIPSIS
    KMeans(copy_x=True, init='k-means++', k=3, max_iter=300,...
    >>> print k_means.labels_[::10]
    [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
    >>> print y_iris[::10]
    [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

.. |k_means_iris_bad_init| image:: k_means_iris_bad_init.png
   :scale: 63

.. |k_means_iris_8| image:: k_means_iris_8.png
   :scale: 63

.. |cluster_iris_truth| image:: cluster_iris_truth.png
   :scale: 63

.. warning:: 
   
    There is absolutely no garanty of recovering a ground truth. First
    choosing the right number of clusters is hard. Second, the algorithm
    is sensitive to initialization, and can fall in local minima,
    although in the `scikits.learn` we play many tricks to mitigate this
    issue.

    .. list-table::
        :class: centered
        
        * 
        
            - |k_means_iris_bad_init|

            - |k_means_iris_8|

            - |cluster_iris_truth|

        * 
        
            - **Bad initialization**
            
            - **8 clusters**
            
            - **Ground truth**

    **Don't over-interpret clustering results**

.. |lena| image:: lena.png
   :scale: 30

.. |lena_regular| image:: lena_regular.png
   :scale: 30

.. |lena_compressed| image:: lena_compressed.png
   :scale: 30

.. |lena_histogram| image:: lena_histogram.png
   :scale: 40

.. topic:: **Application example: vector quantization**

    Clustering in general and KMeans in particular, can be seen as a way
    of choosing a small number of examplars to compress the information,
    a problem sometimes known as vector quantization. For instance, this
    can be used to posterize an image::

    >>> import scipy as sp
    >>> lena = sp.lena()
    >>> X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array
    >>> k_means = cluster.KMeans(k=5, n_init=1)
    >>> k_means.fit(X) # doctest: +ELLIPSIS
    KMeans(copy_x=True, init='k-means++', k=5, ...
    >>> values = k_means.cluster_centers_.squeeze()
    >>> labels = k_means.labels_
    >>> lena_compressed = np.choose(labels, values)
    >>> lena_compressed.shape = lena.shape

    .. list-table::
      :class: centered 

      *
        - |lena|

        - |lena_compressed|

        - |lena_regular|

        - |lena_histogram|

      *

        - Raw image

        - K-means quantization

        - Equal bins

        - Image histogram


Hierarchical clustering: Ward
------------------------------

For estimating a large number of clusters, top-down approaches are both
statisticaly ill-posed, and slow. Hierarchical clustering is a bottom-up
approach that merges successively observations together and is
particularly useful when the clusters of interest are made of only a few
observations. *Ward* clustering minimizes a criterion similar to k-means
in a bottom-up approach. When the number of clusters is large, it is much
more computationally efficient than k-means.

Connectivity-constrained clustering
.....................................

With Ward clustering, it is possible to specify which samples can be
clustered together by giving a connectivity graph. Graphs in the scikit
are represented by their adjacency matrix. Often a sparse matrix is used.
This can be useful for instance to retrieve connect regions when
clustering an image:

.. image:: lena_ward.png
    :scale: 70
    :align: right

::

    >>> # Downsample the image by a factor of 4
    >>> lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
    >>> X = np.reshape(lena, (-1, 1))

    >>> # the structure of the data: pixels connected to their neighbors
    >>> from scikits.learn.feature_extraction.image import grid_to_graph
    >>> connectivity = grid_to_graph(*lena.shape)

    >>> ward = cluster.Ward(n_clusters=30, connectivity=connectivity)
    >>> ward.fit(X) # doctest: +ELLIPSIS
    Ward(connectivity=...
    >>> labels = np.reshape(ward.labels_, lena.shape)

..  To generate the image
    >>> pl.imsave('lena_ward.png', labels)


Feature agglomeration
......................

We have seen that sparsity could be used to mitigate the curse of
dimensionality, *i.e* the insufficience of observations compared to the
number of features. Another approach is to merge together similar
features: **feature agglomeration**. This approach can be implementing by
clustering in the feature direction, in other words clustering the
transposed data.

.. image:: digits_agglo.png
    :align: right
    :scale: 57

::

   >>> digits = datasets.load_digits()
   >>> images = digits.images
   >>> X = np.reshape(images, (len(images), -1))
   >>> connectivity = grid_to_graph(*images[0].shape)

   >>> agglo = cluster.WardAgglomeration(connectivity=connectivity,
   ...                                   n_clusters=32)
   >>> agglo.fit(X) # doctest: +ELLIPSIS
   WardAgglomeration(connectivity=...
   >>> X_reduced = agglo.transform(X)

   >>> X_approx = agglo.inverse_transform(X_reduced)
   >>> images_approx = np.reshape(X_approx, images.shape)

.. topic:: `transform` and `inverse_transform` methods

   Some estimators expose a `transform` method, for instance to reduce
   the dimensionality of the dataset.

Decompositions: from a signal to components and loadings
===========================================================

.. topic:: **Components and loadings**

   If X is our multivariate data, the problem that we are trying to solve
   is to rewrite it on a different observation basis: we want to learn
   loadings L and a set of components C such that *X = L C*.
   Different criteria exist to choose the components

Principal component analysis: PCA
-----------------------------------

Principal component analysis select the successive components that
explain the maximum variance in the signal.

.. |pca_3d_axis| image:: pca_3d_axis.jpg
   :scale: 70

.. |pca_3d_aligned| image:: pca_3d_aligned.jpg
   :scale: 70

.. rst-class:: centered

   |pca_3d_axis| |pca_3d_aligned|

The point cloud spanned by the observations above is very flat in one
direction: one of the 3 univariate features can almost be exactly
computed using the 2 other. PCA finds the directions in which the data is
not *flat*

When used to *transform* data, PCA can reduce the dimensionality of the
data by projecting on a principal subspace.

.. np.random.seed(0)

::

    >>> # Create a signal with only 2 useful dimensions
    >>> x1 = np.random.normal(size=100)
    >>> x2 = np.random.normal(size=100)
    >>> x3 = x1 + x2
    >>> X = np.c_[x1, x2, x3]

    >>> from scikits.learn import decomposition
    >>> pca = decomposition.PCA()
    >>> pca.fit(X)
    PCA(copy=True, n_components=None, whiten=False)
    >>> print pca.explained_variance_  # doctest: +ELLIPSIS
    [  2.46782...e+00   8.054...e-01   9.7...e-33]

    >>> # As we can see, only the 2 first components are useful
    >>> pca.n_components = 2
    >>> X_reduced = pca.fit_transform(X)
    >>> X_reduced.shape
    (100, 2)

.. Eigenfaces here?

Indenpendant Component Analysis: ICA
-------------------------------------

ICA selects components so that the distribution of their loadings carries
a maximum amount of independant information. It is able to recover
**non-Gaussian** independant signals:

.. image:: plot_ica_blind_source_separation_1.png
   :scale: 70
   :align: center

.. np.random.seed(0)

::

    >>> # Generate sample data
    >>> time = np.linspace(0, 10, 2000)
    >>> s1 = np.sin(2 * time)  # Signal 1 : sinusoidal signal
    >>> s2 = np.sign(np.sin(3 * time))  # Signal 2 : square signal
    >>> S = np.c_[s1, s2]
    >>> S += 0.2 * np.random.normal(size=S.shape)  # Add noise
    >>> S /= S.std(axis=0)  # Standardize data
    >>> # Mix data
    >>> A = np.array([[1, 1], [0.5, 2]])  # Mixing matrix
    >>> X = np.dot(S, A.T)  # Generate observations

    >>> # Compute ICA
    >>> ica = decomposition.FastICA()
    >>> S_ = ica.fit(X).transform(X)  # Get the estimated sources
    >>> A_ = ica.get_mixing_matrix()  # Get estimated mixing matrix
    >>> np.allclose(X, np.dot(S_, A_.T))
    True