This page

Citing

Please consider citing the scikit-learn.

9.19.1.1. sklearn.datasets.load_files

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, random_state=None)

Load text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/
category_1_folder/
file_1.txt file_2.txt ... file_42.txt
category_2_folder/
file_43.txt file_44.txt ...

The folder names are used has supervised signal label names. The indivial file names are not important.

This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if load_content is false it does not try to load the files in memory.

To use utf-8 text files in a scikit-learn classification or clustering algorithm you will first need to use the sklearn.features.text module to build a feature extraction transformer that suits your problem.

Similar feature extractors should be build for other kind of unstructured data input such as images, audio, video, ...

Parameters :

container_path : string or unicode

Path to the main folder holding one subfolder per category

description: string or unicode, optional (default=None) :

A paragraph describing the characteristic of the dataset: its source, reference, etc.

categories : A collection of strings or None, optional (default=None)

If None (default), load all the categories. If not None, list of category names to load (other categories ignored).

load_content : boolean, optional (default=True)

Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

shuffle : bool, optional (default=True)

Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.