This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

8.7.2.2. sklearn.feature_extraction.text.WordNGramAnalyzer

class sklearn.feature_extraction.text.WordNGramAnalyzer(charset='utf-8', min_n=1, max_n=1, preprocessor=RomanPreprocessor(), stop_words='english', token_pattern=u'\b\w\w+\b')

Simple analyzer: transform text document into a sequence of word tokens

This simple implementation does:
  • lower case conversion
  • unicode accents removal
  • token extraction using unicode regexp word bounderies for token of minimum size of 2 symbols (by default)
  • output token n-grams (unigram only by default)

The stop words argument may be “english” for a built-in list of English stop words or a collection of strings. Note that stop word filtering is performed after preprocessing, which may include accent stripping.

Methods

analyze(text_document) From documents to token
set_params(**params) Set the parameters of the estimator.
__init__(charset='utf-8', min_n=1, max_n=1, preprocessor=RomanPreprocessor(), stop_words='english', token_pattern=u'\b\w\w+\b')
analyze(text_document)

From documents to token

set_params(**params)

Set the parameters of the estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns :self :