9.16.2.6. sklearn.feature_extraction.text.Vectorizer¶
- Vectorizer(analyzer=WordNGramAnalyzer(charset='utf-8', max_n=1, min_n=1,
- preprocessor=RomanPreprocessor(),
- stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom',...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
- token_pattern='\b\w\w+\b'), max_df=1.0, max_features=None, norm='l2', use_idf=True, smooth_idf=True)
Convert a collection of raw documents to a matrix
Equivalent to CountVectorizer followed by TfidfTransformer.
Methods
fit fit_transform inverse_transform set_params transform - Vectorizer.fit(raw_documents)¶
Learn a conversion law from documents to array data
- Vectorizer.fit_transform(raw_documents)¶
Learn the representation and return the vectors.
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: array, [n_samples, n_features] :
- Vectorizer.inverse_transform(X)¶
Return terms per document with nonzero entries in X.
Parameters : X : {array, sparse matrix}, shape = [n_samples, n_features]
Returns : X_inv : list of arrays, len = n_samples
List of arrays of terms.
- Vectorizer.set_params(**params)¶
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
Returns : self :
- Vectorizer.transform(raw_documents, copy=True)¶
Transform raw text documents to tf–idf vectors
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: sparse matrix, [n_samples, n_features] :