This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

6.2. Feature extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

6.2.1. Text feature extraction

6.2.1.1. The Bag of Words representation

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

  • tokenizing strings and giving an integer id for each possible token, for instance by using whitespaces and punctuation as token separators.
  • counting the occurrences of tokens in each document.
  • normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

In this scheme, features and samples are defined as follows:

  • each individual token occurrence frequency (normalized or not) is treated as a feature.
  • the vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific stragegy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

When combined with TF-IDF normalization, the bag of words encoding is also known as the Vector Space Model.

6.2.1.2. Sparsity

As most documents will typically use a very subset of a the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

6.2.1.3. Common Vectorizer usage

CountVectorizer implements both tokenization and occurrence counting in a single class:

>>> from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):

>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer='word', binary=False, charset='utf-8',
        charset_error='strict', dtype=<type 'long'>, input='content',
        lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1,
        preprocessor=None, stop_words=None, strip_accents=None,
        token_pattern=u'\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                                       
<4x9 sparse matrix of type '<type 'numpy.int64'>'
    with 19 stored elements in COOrdinate format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.")
[u'this', u'is', u'text', u'document', u'to', u'analyze']

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

>>> vectorizer.get_feature_names()
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']

>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:

>>> vectorizer.vocabulary_.get('document')
1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

>>> vectorizer.transform(['Something completely new.']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (the word themselvs):

>>> bigram_vectorizer = CountVectorizer(min_n=1, max_n=2,
...                                     token_pattern=ur'\b\w+\b')
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!')
[u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool']

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:

>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]])

In particular the interogative form “Is this” is only present in the last document:

>>> feature_index = bigram_vectorizer.vocabulary_.get(u'is this')
>>> X_2[:, feature_index]
array([0, 0, 0, 1])

6.2.1.4. TF-IDF normalization

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningul information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency. This is a orginally a term weighting scheme developed for information retrieval (as a ranking function for search engines results), that has also found good use in document classification and clustering.

This normalization is implemented by the TfidfTransformer class:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> transformer
TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Again please see the reference documentation for the details on all the parameters.

Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents:

>>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                                  
<6x3 sparse matrix of type '<type 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

>>> tfidf.toarray()                        
array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])

Each row is normalized to have unit euclidean norm. The weights of each feature computed by the fit method call are stored in a model attribute:

>>> transformer.idf_                       
array([ 1. ...,  2.25...,  1.84...])

As tf–idf is a very often used for text features, there is also another class called Vectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model:

>>> from sklearn.feature_extraction.text import Vectorizer
>>> vectorizer = Vectorizer()
>>> vectorizer.fit_transform(corpus)
...                                       
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also very short text are likely to have noisy tf–idf values while the binary occurrence info is more stable.

As usual the only way how to best adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier:

6.2.1.5. Applications and examples

The bag of words representation is quite simplistic but surprisingly useful in practice.

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classificers, for instance:

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignement constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):

6.2.1.6. Limitations of the Bag of Words representation

While some local positioning information can be preserved by extracting n-grams instead of individual words, Bag of Words and Bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.

In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.

6.2.1.7. Customizing the vectorizer classes

It is possible to customize the behavior by passing some callable as parameters of the vectorizer:

>>> def my_tokenizer(s):
...     return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!")
[u'some...', u'punctuation!']

In particular we name:

  • preprocessor a callable that takes a string as input and return another string (removing HTML tags or converting to lower case for instance)
  • tokenizer a callable that takes a string as input and output a sequence of feature occurrences (a.k.a. the tokens).
  • analyzer a callable that wraps calls to the preprocessor and tokenizer and further perform some filtering or n-grams extractions on the tokens.

To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor, build_tokenizer` and build_analyzer factory method instead.

Customizing the vectorizer can be very useful to handle Asian languages that do not use an explicit word separator such as the whitespace for instance.

6.2.2. Image feature extraction

6.2.2.1. Patch extraction

The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g. in RGB format):

>>> import numpy as np
>>> from sklearn.feature_extraction import image

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0]  # R channel of a fake RGB picture
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...     random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
       [27, 30]])

Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:

>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class works in the same way as extract_patches_2d, only it supports multiple images as input. It is implemented as an estimator, so it can be used in pipelines. See:

>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)