8.4.1.7. sklearn.datasets.load_files¶
- sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, random_state=0)¶
Load text files with categories as subfolder names.
Individual samples are assumed to be files stored a two levels folder structure such as the following:
- container_folder/
- category_1_folder/
- file_1.txt file_2.txt ... file_42.txt
- category_2_folder/
- file_43.txt file_44.txt ...
The folder names are used has supervised signal label names. The indivial file names are not important.
This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if load_content is false it does not try to load the files in memory.
To use utf-8 text files in a scikit-learn classification or clustering algorithm you will first need to use the sklearn.features.text module to build a feature extraction transformer that suits your problem.
Similar feature extractors should be build for other kind of unstructured data input such as images, audio, video, ...
Parameters : container_path : string or unicode
Path to the main folder holding one subfolder per category
description: string or unicode, optional (default=None) :
A paragraph describing the characteristic of the dataset: its source, reference, etc.
categories : A collection of strings or None, optional (default=None)
If None (default), load all the categories. If not None, list of category names to load (other categories ignored).
load_content : boolean, optional (default=True)
Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.
shuffle : bool, optional (default=True)
Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.
random_state : int, RandomState instance or None, optional (default=0)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns : data : Bunch
Dictionary-like object, the interesting attributes are: either data, the raw text data to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index), ‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.