This documentation is for scikit-learn version 0.11-gitOther versions

Citing

If you use the software, please consider citing scikit-learn.

This page

8.4.1.2. sklearn.datasets.fetch_20newsgroups

sklearn.datasets.fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, download_if_missing=True)

Load the filenames of the 20 newsgroups dataset.

Parameters :

subset: ‘train’ or ‘test’, ‘all’, optional :

Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.

data_home: optional, default: None :

Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

categories: None or collection of string or unicode :

If None (default), load all the categories. If not None, list of category names to load (other categories ignored).

shuffle: bool, optional :

Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

random_state: numpy random number generator or seed integer :

Used to shuffle the dataset.

download_if_missing: optional, True by default :

If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.