8.7.2.3. sklearn.feature_extraction.text.Vectorizer¶
- class sklearn.feature_extraction.text.Vectorizer(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)¶
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
See also
- CountVectorizer
- Tokenize the documents and count the occurrences of token and return them as a sparse matrix
- TfidfTransformer
- Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.
Methods
build_analyzer() Return a callable that handles preprocessing and tokenization build_preprocessor() Return a function to preprocess the text before tokenization build_tokenizer() Return a function that split a string in sequence of tokens decode(doc) Decode the input into a string of unicode symbols fit(raw_documents) Learn a conversion law from documents to array data fit_transform(raw_documents[, y]) Learn the representation and return the vectors. get_feature_names() Array mapping from feature integer indicex to feature name get_params([deep]) Get parameters for the estimator get_stop_words() Build or fetch the effective stop words list inverse_transform(X) Return terms per document with nonzero entries in X. set_params(**params) Set the parameters of the estimator. transform(raw_documents[, copy]) Transform raw text documents to tf–idf vectors - __init__(input='content', charset='utf-8', charset_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=u'\b\w\w+\b', min_n=1, max_n=1, max_df=1.0, max_features=None, vocabulary=None, binary=False, dtype=<type 'long'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)¶
- build_analyzer()¶
Return a callable that handles preprocessing and tokenization
- build_preprocessor()¶
Return a function to preprocess the text before tokenization
- build_tokenizer()¶
Return a function that split a string in sequence of tokens
- decode(doc)¶
Decode the input into a string of unicode symbols
The decoding strategy depends on the vectorizer parameters.
- fit(raw_documents)¶
Learn a conversion law from documents to array data
- fit_transform(raw_documents, y=None)¶
Learn the representation and return the vectors.
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: array, [n_samples, n_features] :
- get_feature_names()¶
Array mapping from feature integer indicex to feature name
- get_params(deep=True)¶
Get parameters for the estimator
Parameters : deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- get_stop_words()¶
Build or fetch the effective stop words list
- inverse_transform(X)¶
Return terms per document with nonzero entries in X.
Parameters : X : {array, sparse matrix}, shape = [n_samples, n_features]
Returns : X_inv : list of arrays, len = n_samples
List of arrays of terms.
- set_params(**params)¶
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
Returns : self :
- transform(raw_documents, copy=True)¶
Transform raw text documents to tf–idf vectors
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: sparse matrix, [n_samples, n_features] :