Contents

6.14.5. scikits.learn.feature_extraction.text.CountVectorizer

CountVectorizer(analyzer=WordNGramAnalyzer(max_n=1, min_n=1, charset='utf-8',
stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom',...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
preprocessor=RomanPreprocessor()), vocabulary={}, max_df=1.0, max_features=None, dtype=<type 'long'>)

Convert a collection of raw documents to a matrix of token counts

This implementation produces a dense representation of the counts using a numpy array.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features (the vocabulary size found by analysing the data) might be very large and the count vectors might not fit in memory.

For this case it is either recommended to use the sparse.CountVectorizer variant of this class or a HashingVectorizer that will reduce the dimensionality to an arbitrary number by using random projection.

Parameters :

analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional :

vocabulary: dict, optional :

A dictionary where keys are tokens and values are indices in the matrix. This is useful in order to fix the vocabulary in advance.

dtype: type, optional :

Type of the matrix returned by fit_transform() or transform().

Methods

fit
fit_transform
transform
CountVectorizer.fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

self :

CountVectorizer.fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return the count vectors

This is more efficient than calling fit followed by transform.

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :

CountVectorizer.transform(raw_documents)

Extract token counts out of raw text documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :