8.7.2.3. sklearn.feature_extraction.text.Vectorizer¶

class sklearn.feature_extraction.text.Vectorizer(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)¶

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

See also

CountVectorizer: Tokenize the documents and count the occurrences of token and return them as a sparse matrix
TfidfTransformer: Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.

Methods

`build_analyzer`()	Return a callable that handles preprocessing and tokenization
`build_preprocessor`()	Return a function to preprocess the text before tokenization
`build_tokenizer`()	Return a function that split a string in sequence of tokens
`decode`(doc)	Decode the input into a string of unicode symbols
`fit`(raw_documents)	Learn a conversion law from documents to array data
`fit_transform`(raw_documents[, y])	Learn the representation and return the vectors.
`get_feature_names`()	Array mapping from feature integer indicex to feature name
`get_params`([deep])	Get parameters for the estimator
`get_stop_words`()	Build or fetch the effective stop words list
`inverse_transform`(X)	Return terms per document with nonzero entries in X.
`set_params`(**params)	Set the parameters of the estimator.
`transform`(raw_documents[, copy])	Transform raw text documents to tf–idf vectors

__init__(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)¶

build_analyzer()¶: Return a callable that handles preprocessing and tokenization

build_preprocessor()¶: Return a function to preprocess the text before tokenization

build_tokenizer()¶: Return a function that split a string in sequence of tokens

decode(doc)¶

Decode the input into a string of unicode symbols

The decoding strategy depends on the vectorizer parameters.

fit(raw_documents)¶: Learn a conversion law from documents to array data

fit_transform(raw_documents, y=None)¶

Learn the representation and return the vectors.

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: array, [n_samples, n_features] :

get_feature_names()¶: Array mapping from feature integer indicex to feature name

get_params(deep=True)¶

Get parameters for the estimator

Parameters :

deep: boolean, optional :

If True, will return the parameters for this estimator and contained subobjects that are estimators.

get_stop_words()¶: Build or fetch the effective stop words list

inverse_transform(X)¶

Return terms per document with nonzero entries in X.

Parameters :

X : {array, sparse matrix}, shape = [n_samples, n_features]

Returns :

X_inv : list of arrays, len = n_samples

List of arrays of terms.

set_params(**params)¶

Set the parameters of the estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns :	self :

transform(raw_documents, copy=True)¶

Transform raw text documents to tf–idf vectors

Parameters :

raw_documents: iterable :

an iterable which yields either str, unicode or file objects

Returns :

vectors: sparse matrix, [n_samples, n_features] :

Citing

This page

8.7.2.3. sklearn.feature_extraction.text.Vectorizer¶