8.7.2.3. sklearn.feature_extraction.text.CharNGramAnalyzer¶

class sklearn.feature_extraction.text.CharNGramAnalyzer(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6)¶

Compute character n-grams features of a text document

This analyzer is interesting since it is language agnostic and will work well even for language where word segmentation is not as trivial as English such as Chinese and German for instance.

Because of this, it can be considered a basic morphological analyzer.

Methods

`analyze`(text_document)	From documents to token
`set_params`(**params)	Set the parameters of the estimator.

__init__(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6)¶

analyze(text_document)¶: From documents to token

set_params(**params)¶

Set the parameters of the estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns :	self :

Citing

This page

8.7.2.3. sklearn.feature_extraction.text.CharNGramAnalyzer¶