Contents

6.14.8.2. scikits.learn.feature_extraction.text.sparse.CountVectorizer

CountVectorizer(analyzer=WordNGramAnalyzer(max_n=1, min_n=1, charset='utf-8',
stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom',...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
preprocessor=RomanPreprocessor()), vocabulary={}, max_df=1.0, max_features=None, dtype=<type 'long'>)

Convert a collection of raw documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.

Parameters :

analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional :

vocabulary: dict, optional :

A dictionary where keys are tokens and values are indices in the matrix. This is useful in order to fix the vocabulary in advance.

dtype: type, optional :

Type of the matrix returned by fit_transform() or transform().

Methods

fit
fit_transform
transform
CountVectorizer.fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

self :

CountVectorizer.fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return the count vectors

This is more efficient than calling fit followed by transform.

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :

CountVectorizer.transform(raw_documents)

Extract token counts out of raw text documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :