8.4.1.17. sklearn.datasets.load_svmlight_file¶
- sklearn.datasets.load_svmlight_file(f, n_features=None, dtype=<type 'numpy.float64'>, multilabel=False)¶
- Load datasets in the svmlight / libsvm format into sparse CSR matrix - This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset. - The first element of each line can be used to store a target variable to predict. - This format is used as the default format for both svmlight and the libsvm command line programs. - Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls. - This implementation is naive: it does allocate too much memory and is slow since written in python. On large datasets it is recommended to use an optimized loader such as: - Parameters : - f: str or file-like open in binary mode. : - (Path to) a file to load. - n_features: int or None : - The number of features to use. If None, it will be inferred. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have example of every feature, hence the inferred shape might vary from one slice to another. - multilabel: boolean, optional : - Samples may have several labels each (see http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html) - Returns : - (X, y) : - where X is a scipy.sparse matrix of shape (n_samples, n_features), : - y is a ndarray of shape (n_samples,), or, in the multilabel case, a list of tuples of length n_samples. 
