This page

scikits.learn.feature_extraction.text.CountVectorizer

CountVectorizer(analyzer=WordNGramAnalyzer(stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom',...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
max_n=1, token_pattern='\b\w\w+\b', charset='utf-8', min_n=1,
preprocessor=RomanPreprocessor()), vocabulary={}, max_df=1.0, max_features=None, dtype=<type 'long'>)

Convert a collection of raw documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features (the vocabulary size found by analysing the data) might be very large and the count vectors might not fit in memory.

For this case it is either recommended to use the sparse.CountVectorizer variant of this class or a HashingVectorizer that will reduce the dimensionality to an arbitrary number by using random projection.

Parameters :

analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional :

vocabulary: dict, optional :

A dictionary where keys are tokens and values are indices in the matrix.

This is useful in order to fix the vocabulary in advance.

max_df : float in range [0.0, 1.0], optional, 1.0 by default

When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words).

This parameter is ignored if vocabulary is not None.

max_features : optional, None by default

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

dtype: type, optional :

Type of the matrix returned by fit_transform() or transform().

Methods

CountVectorizer.fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

self :

CountVectorizer.fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return the count vectors

This is more efficient than calling fit followed by transform.

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :

CountVectorizer.transform(raw_documents)

Extract token counts out of raw text documents

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :