skutil.metrics module¶

skutil.metrics houses the pairwise kernel matrix functionality that is built using Cython which behaves similar to scikit-learns pairwise behavior.

class skutil.metrics.GainsStatisticalReport(n_groups=10, n_folds=None, n_iter=None, score_by='lift', iid=True, error_score=nan, error_behavior='warn')[source]¶

Bases: object

A class that computes actuarial statistics for scoring predictions given prescribed weighting and loss data. Primarily intended for use with skutil.h2o.H2OGainsRandomizedSearchCV.

Parameters:

n_groups : int, optional (default=10)

The number of groups to use for lift and gini computations.

score_by : str, optional (default=’lift’)

The metric to return for the score method.

n_folds : int, optional (default=None)

The number of folds that are being fit.

error_score : float, optional (default=np.nan)

The score to return for a pd.qcut error

error_behavior : str, optional (default=’warn’)

One of {‘warn’, ‘raise’, ‘ignore’}. How to handle non-unique bin edges in pd.qcut

Methods

`as_data_frame`()	Get the summary report of the fold fits in the form of a pd.DataFrame.
`fit_fold`(pred, expo, loss[, prem, store])	Used to fit a single fold of predicted values, exposure and loss data.
`score`(_, pred, \\kwargs)	Scores the new predictions on the truth set, and stores the results in the internal stats array.
`score_no_store`(_, pred, \\kwargs)	Scores the new predictions on the truth set, and does not store the results in the internal stats array.

as_data_frame()[source]¶

Get the summary report of the fold fits in the form of a pd.DataFrame.

Returns:

df : pd.DataFrame

A dataframe of summary statistics for each fold

fit_fold(pred, expo, loss, prem=None, store=True)[source]¶

Used to fit a single fold of predicted values, exposure and loss data.

Parameters:

pred : 1d H2OFrame, pd.DataFrame, np.ndarray

The array of predictions

expo : 1d H2OFrame, pd.DataFrame, np.ndarray

The array of exposure values

loss : 1d H2OFrame, pd.DataFrame, np.ndarray

The array of loss values

prem : 1d H2OFrame, pd.DataFrame, np.ndarray, optional (default=None)

The array of premium values. If None, is equal to the expo parameter.

store : bool, optional (default=True)

Whether or not to store the results of the scoring procedure. This is set to false when calling score, which is intended for test data.

Returns:

self :

score(_, pred, **kwargs)[source]¶

Scores the new predictions on the truth set, and stores the results in the internal stats array.

Parameters:

_ : H2OFrame, np.ndarray

The truth set

pred : H2OFrame, np.ndarray

The predictions

Returns:

scr : float

The score (lift/gini) for the new predictions

score_no_store(_, pred, **kwargs)[source]¶

Scores the new predictions on the truth set, and does not store the results in the internal stats array.

Parameters:

_ : H2OFrame, np.ndarray

The truth set

pred : H2OFrame, np.ndarray

The predictions

Returns:

scr : float

The score (lift/gini) for the new predictions

skutil.metrics.check_X_y(X, y, accept_sparse=None, dtype='numeric', order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, multi_output=False, ensure_min_samples=1, ensure_min_features=1, y_numeric=False, warn_on_dtype=False, estimator=None)[source]¶

Input validation for standard estimators.

Checks X and y for consistent length, enforces X 2d and y 1d. Standard input checks are only applied to y, such as checking that y does not have np.nan or np.inf targets. For multi-label y, set multi_output=True to allow 2d and sparse y. If the dtype of X is object, attempt converting to float, raising on failure.

Parameters:

X : nd-array, list or sparse matrix

Input data.

y : nd-array, list or sparse matrix

Labels.

accept_sparse : string, list of string or None (default=None)

String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. None means that sparse matrix input will raise an error. If the input is sparse but not in the allowed format, it will be converted to the first listed format.

dtype : string, type, list of types or None (default=”numeric”)

Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

order : ‘F’, ‘C’ or None (default=None)

Whether an array will be forced to be fortran or c-style.

copy : boolean (default=False)

Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

force_all_finite : boolean (default=True)

Whether to raise an error on np.inf and np.nan in X. This parameter does not influence whether y can have np.inf or np.nan values.

ensure_2d : boolean (default=True)

Whether to make X at least 2d.

allow_nd : boolean (default=False)

Whether to allow X.ndim > 2.

multi_output : boolean (default=False)

Whether to allow 2-d y (array or sparse matrix). If false, y will be validated as a vector. y cannot have np.nan or np.inf values if multi_output=True.

ensure_min_samples : int (default=1)

Make sure that X has a minimum number of samples in its first axis (rows for a 2D array).

ensure_min_features : int (default=1)

Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when X has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

y_numeric : boolean (default=False)

Whether to ensure that y has a numeric type. If dtype of y is object, it is converted to float64. Should only be used for regression algorithms.

warn_on_dtype : boolean (default=False)

Raise DataConversionWarning if the dtype of the input data structure does not match the requested dtype, causing a memory copy.

estimator : str or estimator instance (default=None)

If passed, include the name of the estimator in warning messages.

Returns:

X_converted : object

The converted and validated X.

y_converted : object

The converted and validated y.

skutil.metrics.check_array(array, accept_sparse=None, dtype='numeric', order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator=None)[source]¶

Input validation on an array, list, sparse matrix or similar.

By default, the input is converted to an at least 2D numpy array. If the dtype of the array is object, attempt converting to float, raising on failure.

Parameters:

array : object

Input object to check / convert.

accept_sparse : string, list of string or None (default=None)

String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. None means that sparse matrix input will raise an error. If the input is sparse but not in the allowed format, it will be converted to the first listed format.

dtype : string, type, list of types or None (default=”numeric”)

Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

order : ‘F’, ‘C’ or None (default=None)

Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.

copy : boolean (default=False)

Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

force_all_finite : boolean (default=True)

Whether to raise an error on np.inf and np.nan in X.

ensure_2d : boolean (default=True)

Whether to make X at least 2d.

allow_nd : boolean (default=False)

Whether to allow X.ndim > 2.

ensure_min_samples : int (default=1)

Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.

ensure_min_features : int (default=1)

Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

warn_on_dtype : boolean (default=False)

Raise DataConversionWarning if the dtype of the input data structure does not match the requested dtype, causing a memory copy.

estimator : str or estimator instance (default=None)

If passed, include the name of the estimator in warning messages.

Returns:

X_converted : object

The converted and validated X.

skutil.metrics.exponential_kernel(X, Y=None, sigma=1.0)[source]¶

The exponential_kernel is closely related to the gaussian_kernel, with only the square of the norm left out. It is also an rbf_kernel. Note that the adjustable parameter, sigma, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

The kernel is given by:

\(k(x, y) = exp( -||x-y|| / 2\sigma^2 )\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

sigma : float, optional (default=1.0)

The exponential tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.gaussian_kernel(X, Y=None, sigma=1.0)[source]¶

The gaussian_kernel is closely related to the exponential_kernel. It is also an rbf_kernel. Note that the adjustable parameter, sigma, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

The kernel is given by:

\(k(x, y) = exp( -||x-y||^2 / 2\sigma^2 )\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

sigma : float, optional (default=1.0)

The exponential tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.inverse_multiquadric_kernel(X, Y=None, constant=1.0)[source]¶

The inverse_multiquadric_kernel, as with the gaussian_kernel, results in a kernel matrix with full rank (Micchelli, 1986) and thus forms an infinite dimension feature space.

The kernel is given by:

\(k(x, y) = 1 / sqrt( -||x-y||^2 + c^2 )\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

constant : float, optional (default=1.0)

The linear tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.laplace_kernel(X, Y=None, sigma=1.0)[source]¶

The laplace_kernel is completely equivalent to the exponential_kernel, except for being less sensitive for changes in the sigma parameter. Being equivalent, it is also an rbf_kernel.

The kernel is given by:

\(k(x, y) = exp( -||x-y|| / \sigma )\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

sigma : float, optional (default=1.0)

The exponential tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.linear_kernel(X, Y=None, constant=0.0)[source]¶

The linear_kernel is the simplest kernel function. It is given by the inner product <x,y> plus an optional constant parameter. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with a linear_kernel is the same as standard PCA.

The kernel is given by:

\(k(x, y) = x^Ty + c\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

constant : float, optional (default=0.0)

The linear tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.multiquadric_kernel(X, Y=None, constant=0.0)[source]¶

The multiquadric_kernel can be used in the same situations as the Rational Quadratic kernel. As is the case with the Sigmoid kernel, it is also an example of an non-positive definite kernel.

The kernel is given by:

\(k(x, y) = sqrt( -||x-y||^2 + c^2 )\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

constant : float, optional (default=0.0)

The linear tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.polynomial_kernel(X, Y=None, alpha=1.0, degree=1.0, constant=1.0)[source]¶

The polynomial_kernel is a non-stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized. Adjustable parameters are the slope (alpha), the constant term (constant), and the polynomial degree (degree).

The kernel is given by:

\(k(x, y) = ( \alpha x^Ty + c)^d\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

alpha : float, optional (default=1.0)

The slope tuning parameter.

degree : float, optional (default=1.0)

The polynomial degree tuning parameter.

constant : float, optional (default=1.0)

The linear tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.power_kernel(X, Y=None, degree=1.0)[source]¶

The power_kernel is also known as the (unrectified) triangular kernel. It is an example of scale-invariant kernel (Sahbi and Fleuret, 2004) and is also only conditionally positive definite.

The kernel is given by:

\(k(x, y) = -||x-y||^d\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

degree : float, optional (default=1.0)

The polynomial degree tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.rbf_kernel(X, Y=None, sigma=1.0)[source]¶

The rbf_kernel is closely related to the exponential_kernel and gaussian_kernel. Note that the adjustable parameter, sigma, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.

The kernel is given by:

\(k(x, y) = exp(- \gamma * ||x-y||^2)\)

where:

\(\gamma = 1/( \sigma ^2)\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

sigma : float, optional (default=1.0)

The exponential tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

skutil.metrics.spline_kernel(X, Y=None)[source]¶

The spline_kernel is given as a piece-wise cubic polynomial, as derived in the works by Gunn (1998).

The kernel is given by:

\(k(x, y) = 1 + xy + xy * min(x,y) - (1/2 * (x+y)) * min(x,y)^2 + 1/3 * min(x,y)^3\)

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y
: array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Returns:

res : float

The result of the kernel computation.

skutil.metrics.tanh_kernel(X, Y=None, constant=0.0, alpha=1.0)[source]¶

The tanh_kernel (Hyperbolic Tangent Kernel) is also known as the Sigmoid Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons.

The kernel is given by:

\(k(x, y) = tanh (\alpha x^T y + c)\)

It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice.

There are two adjustable parameters in the sigmoid kernel, the slope alpha and the intercept constant. A common value for alpha is 1/N, where N is the data dimension. A more detailed study on sigmoid kernels can be found in the works by Hsuan-Tien and Chih-Jen.

Parameters:

X : array_like (float), shape=(n_samples, n_features)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

Y : array_like (float), shape=(n_samples, n_features), optional (default=None)

The array of pandas DataFrame on which to compute the kernel. If Y is None, the kernel will be computed with X.

constant : float, optional (default=0.0)

The linear tuning parameter.

alpha : float, optional (default=1.0)

The slope tuning parameter.

Returns:

c : float

The result of the kernel computation.

References

Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html