8.4.1.17. sklearn.datasets.load_svmlight_file¶

sklearn.datasets.load_svmlight_file(f, n_features=None, dtype=<type 'numpy.float64'>, multilabel=False)¶

Load datasets in the svmlight / libsvm format into sparse CSR matrix

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

This format is used as the default format for both svmlight and the libsvm command line programs.

Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.

This implementation is naive: it does allocate too much memory and is slow since written in python. On large datasets it is recommended to use an optimized loader such as:

https://github.com/mblondel/svmlight-loader

Parameters :

f: str or file-like open in binary mode. :

(Path to) a file to load.

n_features: int or None :

The number of features to use. If None, it will be inferred. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have example of every feature, hence the inferred shape might vary from one slice to another.

multilabel: boolean, optional :

Samples may have several labels each (see http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)

Returns :

(X, y) :

where X is a scipy.sparse matrix of shape (n_samples, n_features), :

y is a ndarray of shape (n_samples,), or, in the multilabel case, a list of tuples of length n_samples.

Citing

This page

8.4.1.17. sklearn.datasets.load_svmlight_file¶