class sklearn.feature_extraction.text.Vectorizer(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

Tokenize the documents and count the occurrences of token and return them as a sparse matrix
Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.


build_analyzer() Return a callable that handles preprocessing and tokenization
build_preprocessor() Return a function to preprocess the text before tokenization
build_tokenizer() Return a function that split a string in sequence of tokens
decode(doc) Decode the input into a string of unicode symbols
fit(raw_documents) Learn a conversion law from documents to array data
fit_transform(raw_documents[, y]) Learn the representation and return the vectors.
get_feature_names() Array mapping from feature integer indicex to feature name
get_params([deep]) Get parameters for the estimator
get_stop_words() Build or fetch the effective stop words list
inverse_transform(X) Return terms per document with nonzero entries in X.
set_params(**params) Set the parameters of the estimator.
transform(raw_documents[, copy]) Transform raw text documents to tf–idf vectors
__init__(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Return a callable that handles preprocessing and tokenization


Return a function to preprocess the text before tokenization


Return a function that split a string in sequence of tokens


Decode the input into a string of unicode symbols

The decoding strategy depends on the vectorizer parameters.


Learn a conversion law from documents to array data

fit_transform(raw_documents, y=None)

Learn the representation and return the vectors.

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :


Array mapping from feature integer indicex to feature name


Get parameters for the estimator

Parameters :

deep: boolean, optional :

If True, will return the parameters for this estimator and contained subobjects that are estimators.


Build or fetch the effective stop words list


Return terms per document with nonzero entries in X.

Parameters :

X : {array, sparse matrix}, shape = [n_samples, n_features]

Returns :

X_inv : list of arrays, len = n_samples

List of arrays of terms.


Set the parameters of the estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns :self :
transform(raw_documents, copy=True)

Transform raw text documents to tf–idf vectors

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: sparse matrix, [n_samples, n_features] :