scikits.learn.feature_extraction.text.TfidfTransformer¶

class scikits.learn.feature_extraction.text.TfidfTransformer(use_tf=True, use_idf=True)¶

Transform a count matrix to a TF or TF-IDF representation

TF means term-frequency while TF-IDF means term-frequency times inverse document-frequency:

http://en.wikipedia.org/wiki/TF-IDF

The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than feature that occur in a small fraction of the training corpus.

TF-IDF can be seen as a smooth alternative to the stop words filtering.

Parameters :

use_tf: boolean :

enable term-frequency normalization

use_idf: boolean :

enable inverse-document-frequency reweighting

Methods

__init__(use_tf=True, use_idf=True)¶

fit(X, y=None)¶

Learn the IDF vector (global term weights)

Parameters :

X: sparse matrix, [n_samples, n_features] :

a matrix of term/token counts

transform(X, copy=True)¶

Transform a count matrix to a TF or TF-IDF representation

Parameters :

X: sparse matrix, [n_samples, n_features] :

a matrix of term/token counts

Returns :

vectors: sparse matrix, [n_samples, n_features] :

This page

scikits.learn.feature_extraction.text.TfidfTransformer¶