This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

8.4.1.7. sklearn.datasets.load_files

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, charset=None, charse_error='strict', random_state=0)

Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/
category_1_folder/
file_1.txt file_2.txt ... file_42.txt
category_2_folder/
file_43.txt file_44.txt ...

The folder names are used has supervised signal label names. The indivial file names are not important.

This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if load_content is false it does not try to load the files in memory.

To use utf-8 text files in a scikit-learn classification or clustering algorithm you will first need to use the sklearn.features.text module to build a feature extraction transformer that suits your problem.

Similar feature extractors should be build for other kind of unstructured data input such as images, audio, video, ...

Parameters :

container_path : string or unicode

Path to the main folder holding one subfolder per category

description: string or unicode, optional (default=None) :

A paragraph describing the characteristic of the dataset: its source, reference, etc.

categories : A collection of strings or None, optional (default=None)

If None (default), load all the categories. If not None, list of category names to load (other categories ignored).

load_content : boolean, optional (default=True)

Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

charset : string or None (default is None)

If None, do not try to decode the content of the files (e.g. for images or other non-text content). If not None, charset to use to decode text files if load_content is True.

charset_error: {‘strict’, ‘ignore’, ‘replace’} :

Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given charset. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

shuffle : bool, optional (default=True)

Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

random_state : int, RandomState instance or None, optional (default=0)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns :

data : Bunch

Dictionary-like object, the interesting attributes are: either data, the raw text data to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index), ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.