This page

scikits.learn.feature_extraction.text.CharNGramAnalyzer

class scikits.learn.feature_extraction.text.CharNGramAnalyzer(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6)

Compute character n-grams features of a text document

This analyzer is interesting since it is language agnostic and will work well even for language where word segmentation is not as trivial as English such as Chinese and German for instance.

Because of this, it can be considered a basic morphological analyzer.

Methods

__init__(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6)
analyze(text_document)

From documents to token