Utilities for Developers¶

Scikit-learn contains a number of utilities to help with development. These are located in sklearn.utils, and include tools in a number of categories. All the following functions and classes are in the module sklearn.utils.

Warning

These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

Validation Tools¶

These are tools used to check and validate input. When you write a function which accepts arrays, matrices, or sparse matrices as arguments, the following should be used when applicable.

assert_all_finite: Throw an error if array contains NaNs or Infs.
safe_asarray: Convert input to array or sparse matrix. Equivalent to np.asarray, but sparse matrices are passed through.
as_float_array: convert input to an array of floats. If a sparse matrix is passed, a sparse matrix will be returned.
array2d: equivalent to np.atleast_2d, but the order and dtype of the input are maintained.
atleast2d_or_csr: equivalent to array2d, but if a sparse matrix is passed, will convert to csr format. Also calls assert_all_finite.
check_arrays: check that all input arrays have consistent first dimensions. This will work for an arbitrary number of arrays.
warn_if_not_float: Warn if input is not a floating-point value. the input X is assumed to have X.dtype.

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function. The function check_random_state, below, can then be used to create a random number generator object.

check_random_state: create a np.random.RandomState object from a parameter random_state.
- If random_state is None or np.random, then a randomly-initialized RandomState object is returned.
- If random_state is an integer, then it is used to seed a new RandomState object.
- If random_state is a RandomState object, then it is passed through.

For example:

>>> from sklearn.utils import check_random_state
>>> random_state = 0
>>> random_state = check_random_state(random_state)
>>> random_state.rand(4)
array([ 0.5488135 ,  0.71518937,  0.60276338,  0.54488318])

Efficient Linear Algebra & Array Operations¶

extmath.randomized_range_finder: construct an orthonormal matrix whose range approximates the range of the input. This is used in extmath.randomized_svd, below.
extmath.randomized_svd: compute the k-truncated randomized SVD. This algorithm finds the exact truncated singular values decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components.
arrayfuncs.cholesky_delete: (used in sklearn.linear_model.least_angle.lars_path) Remove an item from a cholesky factorization.
arrayfuncs.min_pos: (used in sklearn.linear_model.least_angle) Find the minimum of the positive values within an array.
extmath.norm: computes Euclidean (L2) vector norm by directly calling the BLAS nrm2 function. This is more stable than scipy.linalg.norm. See Fabian’s blog post for a discussion.
extmath.fast_logdet: efficiently compute the log of the determinant of a matrix.
extmath.density: efficiently compute the density of a sparse vector
extmath.safe_sparse_dot: dot product which will correctly handle scipy.sparse inputs. If the inputs are dense, it is equivalent to numpy.dot.
extmath.logsumexp: compute the sum of X assuming X is in the log domain. This is equivalent to calling np.log(np.sum(np.exp(X))), but is robust to overflow/underflow errors. Note that there is similar functionality in np.logaddexp.reduce, but because of the pairwise nature of this routine, it is slower for large arrays. Scipy has a similar routine in scipy.misc.logsumexp (In scipy versions < 0.10, this is found in scipy.maxentropy.logsumexp), but the scipy version does not accept an axis keyword.
extmath.weighted_mode: an extension of scipy.stats.mode which allows each item to have a real-valued weight.
resample: Resample arrays or sparse matrices in a consistent way. used in shuffle, below.
shuffle: Shuffle arrays or sparse matrices in a consistent way. Used in sklearn.cluster.k_means.

Efficient Routines for Sparse Matrices¶

The sklearn.utils.sparsefuncs cython module hosts compiled extensions to efficiently process scipy.sparse data.

sparsefuncs.mean_variance_axis0: compute the means and variances along axis 0 of a CSR matrix. Used for normalizing the tolerance stopping criterion in sklearn.cluster.k_means_.KMeans.
sparsefuncs.inplace_csr_row_normalize_l1 and sparsefuncs.inplace_csr_row_normalize_l2: can be used to normalize individual sparse samples to unit l1 or l2 norm as done in sklearn.preprocessing.Normalizer.
sparsefuncs.inplace_csr_column_scale: can be used to multiply the columns of a CSR matrix by a constant scale (one scale per column). Used for scaling features to unit standard deviation in sklearn.preprocessing.Scaler.

Graph Routines¶

graph.single_source_shortest_path_length: (not currently used in scikit-learn) Return the shortest path from a single source to all connected nodes on a graph. Code is adapted from networkx. If this is ever needed again, it would be far faster to use a single iteration of Dijkstra’s algorithm from graph_shortest_path.
graph.graph_laplacian: (used in sklearn.cluster.spectral.spectral_embedding) Return the Laplacian of a given graph. There is specialized code for both dense and sparse connectivity matrices.
graph_shortest_path.graph_shortest_path: (used in :class:sklearn.manifold.Isomap) Return the shortest path between all pairs of connected points on a directed or undirected graph. Both the Floyd-Warshall algorithm and Dijkstra’s algorithm are available. The algorithm is most efficient when the connectivity matrix is a scipy.sparse.csr_matrix.

Backports¶

fixes.Counter (partial backport of collections.Counter from Python 2.7) Used in sklearn.feature_extraction.text.
fixes.unique: (backport of np.unique from numpy 1.4). Find the unique entries in an array. In numpy versions < 1.4, np.unique is less flexible. Used in sklearn.cross_validation.
fixes.copysign: (backport of np.copysign from numpy 1.4). Change the sign of x1 to that of x2, element-wise.
fixes.in1d: (backport of np.in1d from numpy 1.4). Test whether each element of an array is in a second array. Used in sklearn.datasets.twenty_newsgroups and sklearn.feature_extraction.image.
fixes.savemat (backport of scipy.io.savemat from scipy 0.7.2). Save an array in MATLAB-format. In earlier versions, the keyword oned_as is not available.
fixes.count_nonzero (backport of np.count_nonzero from numpy 1.6). Count the nonzero elements of a matrix. Used in tests of sklearn.linear_model.
arrayfuncs.solve_triangular (Back-ported from scipy v0.9) Used in sklearn.linear_model.omp, independent back-ports in sklearn.mixture.gmm and sklearn.gaussian_process.
sparsetools.cs_graph_components (backported from scipy.sparse.cs_graph_components in scipy 0.9). Used in sklearn.cluster.hierarchical, as well as in tests for sklearn.feature_extraction.

ARPACK¶

arpack.eigs (backported from scipy.sparse.linalg.eigs in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of eigs is available in earlier scipy versions.
arpack.eigsh (backported from scipy.sparse.linalg.eigsh in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of eigsh is available in earlier scipy versions.
arpack.svds (backported from scipy.sparse.linalg.svds in scipy 0.10) Sparse non-symmetric eigenvalue decomposition using the Arnoldi method. A limited version of svds is available in earlier scipy versions.

Benchmarking¶

bench.total_seconds (back-ported from timedelta.total_seconds in Python 2.7). Used in benchmarks/bench_glm.py.

Testing Functions¶

testing.assert_in: Compare string elements within lists. Used in sklearn.datasets tests.
mock_urllib2: Object which mocks the urllib2 module to fake requests of mldata. Used in tests of sklearn.datasets.

Helper Functions¶

gen_even_slices: generator to create n-packs of slices going up to n. Used in sklearn.decomposition.dict_learning and sklearn.cluster.k_means.
arraybuilder.ArrayBuilder: Helper class to incrementally build a 1-d numpy.ndarray. Currently used in sklearn.datasets._svmlight_format.pyx.
safe_mask: Helper function to convert a mask to the format expected by the numpy array or scipy sparse matrix on which to use it (sparse matrices support integer indices only while numpy arrays support both boolean masks and integer indices).

Hash Functions¶

murmurhash3_32 provides a python wrapper for the MurmurHash3_x86_32 C++ non cryptographic hash function. This hash function is suitable for implementing lookup tables, Bloom filters, Count Min Sketch, feature hashing and implicitly defined sparse random projections:
```
>>> from sklearn.utils import murmurhash3_32
>>> murmurhash3_32("some feature", seed=0)
-384616559

>>> murmurhash3_32("some feature", seed=0, positive=True)
3910350737L
```
The sklearn.utils.murmurhash module can also be “cimported” from other cython modules so as to benefit from the high performance of MurmurHash while skipping the overhead of the Python interpreter.

Warnings and Exceptions¶

deprecated: Decorator to mark a function or class as deprecated.
ConvergenceWarning: Custom warning to catch convergence problems. Used in sklearn.covariance.graph_lasso.

Citing

This page