skutil.metrics module¶
skutil.metrics houses the pairwise kernel matrix functionality that is built using Cython which behaves similar to scikit-learns pairwise behavior.
-
class
skutil.metrics.
GainsStatisticalReport
(n_groups=10, n_folds=None, n_iter=None, score_by='lift', iid=True, error_score=nan, error_behavior='warn')[source]¶ Bases:
object
A class that computes actuarial statistics for scoring predictions given prescribed weighting and loss data. Primarily intended for use with
skutil.h2o.H2OGainsRandomizedSearchCV
.Parameters: n_groups : int, optional (default=10)
The number of groups to use for lift and gini computations.
score_by : str, optional (default=’lift’)
The metric to return for the
score
method.n_folds : int, optional (default=None)
The number of folds that are being fit.
error_score : float, optional (default=np.nan)
The score to return for a
pd.qcut
errorerror_behavior : str, optional (default=’warn’)
One of {‘warn’, ‘raise’, ‘ignore’}. How to handle non-unique bin edges in pd.qcut
Methods
as_data_frame
()Get the summary report of the fold fits in the form of a pd.DataFrame. fit_fold
(pred, expo, loss[, prem, store])Used to fit a single fold of predicted values, exposure and loss data. score
(_, pred, \*\*kwargs)Scores the new predictions on the truth set, and stores the results in the internal stats array. score_no_store
(_, pred, \*\*kwargs)Scores the new predictions on the truth set, and does not store the results in the internal stats array. -
as_data_frame
()[source]¶ Get the summary report of the fold fits in the form of a pd.DataFrame.
Returns: df : pd.DataFrame
A dataframe of summary statistics for each fold
-
fit_fold
(pred, expo, loss, prem=None, store=True)[source]¶ Used to fit a single fold of predicted values, exposure and loss data.
Parameters: pred : 1d H2OFrame, pd.DataFrame, np.ndarray
The array of predictions
expo : 1d H2OFrame, pd.DataFrame, np.ndarray
The array of exposure values
loss : 1d H2OFrame, pd.DataFrame, np.ndarray
The array of loss values
prem : 1d H2OFrame, pd.DataFrame, np.ndarray, optional (default=None)
The array of premium values. If None, is equal to the
expo
parameter.store : bool, optional (default=True)
Whether or not to store the results of the scoring procedure. This is set to false when calling
score
, which is intended for test data.Returns: self :
-
score
(_, pred, **kwargs)[source]¶ Scores the new predictions on the truth set, and stores the results in the internal stats array.
Parameters: _ : H2OFrame, np.ndarray
The truth set
pred : H2OFrame, np.ndarray
The predictions
Returns: scr : float
The score (lift/gini) for the new predictions
-
score_no_store
(_, pred, **kwargs)[source]¶ Scores the new predictions on the truth set, and does not store the results in the internal stats array.
Parameters: _ : H2OFrame, np.ndarray
The truth set
pred : H2OFrame, np.ndarray
The predictions
Returns: scr : float
The score (lift/gini) for the new predictions
-
-
skutil.metrics.
check_X_y
(X, y, accept_sparse=None, dtype='numeric', order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, multi_output=False, ensure_min_samples=1, ensure_min_features=1, y_numeric=False, warn_on_dtype=False, estimator=None)[source]¶ Input validation for standard estimators.
Checks X and y for consistent length, enforces X 2d and y 1d. Standard input checks are only applied to y, such as checking that y does not have np.nan or np.inf targets. For multi-label y, set multi_output=True to allow 2d and sparse y. If the dtype of X is object, attempt converting to float, raising on failure.
Parameters: X : nd-array, list or sparse matrix
Input data.
y : nd-array, list or sparse matrix
Labels.
accept_sparse : string, list of string or None (default=None)
String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. None means that sparse matrix input will raise an error. If the input is sparse but not in the allowed format, it will be converted to the first listed format.
dtype : string, type, list of types or None (default=”numeric”)
Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.
order : ‘F’, ‘C’ or None (default=None)
Whether an array will be forced to be fortran or c-style.
copy : boolean (default=False)
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
force_all_finite : boolean (default=True)
Whether to raise an error on np.inf and np.nan in X. This parameter does not influence whether y can have np.inf or np.nan values.
ensure_2d : boolean (default=True)
Whether to make X at least 2d.
allow_nd : boolean (default=False)
Whether to allow X.ndim > 2.
multi_output : boolean (default=False)
Whether to allow 2-d y (array or sparse matrix). If false, y will be validated as a vector. y cannot have np.nan or np.inf values if multi_output=True.
ensure_min_samples : int (default=1)
Make sure that X has a minimum number of samples in its first axis (rows for a 2D array).
ensure_min_features : int (default=1)
Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when X has effectively 2 dimensions or is originally 1D and
ensure_2d
is True. Setting to 0 disables this check.y_numeric : boolean (default=False)
Whether to ensure that y has a numeric type. If dtype of y is object, it is converted to float64. Should only be used for regression algorithms.
warn_on_dtype : boolean (default=False)
Raise DataConversionWarning if the dtype of the input data structure does not match the requested dtype, causing a memory copy.
estimator : str or estimator instance (default=None)
If passed, include the name of the estimator in warning messages.
Returns: X_converted : object
The converted and validated X.
y_converted : object
The converted and validated y.
-
skutil.metrics.
check_array
(array, accept_sparse=None, dtype='numeric', order=None, copy=False, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator=None)[source]¶ Input validation on an array, list, sparse matrix or similar.
By default, the input is converted to an at least 2D numpy array. If the dtype of the array is object, attempt converting to float, raising on failure.
Parameters: array : object
Input object to check / convert.
accept_sparse : string, list of string or None (default=None)
String[s] representing allowed sparse matrix formats, such as ‘csc’, ‘csr’, etc. None means that sparse matrix input will raise an error. If the input is sparse but not in the allowed format, it will be converted to the first listed format.
dtype : string, type, list of types or None (default=”numeric”)
Data type of result. If None, the dtype of the input is preserved. If “numeric”, dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.
order : ‘F’, ‘C’ or None (default=None)
Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.
copy : boolean (default=False)
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
force_all_finite : boolean (default=True)
Whether to raise an error on np.inf and np.nan in X.
ensure_2d : boolean (default=True)
Whether to make X at least 2d.
allow_nd : boolean (default=False)
Whether to allow X.ndim > 2.
ensure_min_samples : int (default=1)
Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.
ensure_min_features : int (default=1)
Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and
ensure_2d
is True. Setting to 0 disables this check.warn_on_dtype : boolean (default=False)
Raise DataConversionWarning if the dtype of the input data structure does not match the requested dtype, causing a memory copy.
estimator : str or estimator instance (default=None)
If passed, include the name of the estimator in warning messages.
Returns: X_converted : object
The converted and validated X.
-
skutil.metrics.
exponential_kernel
(X, Y=None, sigma=1.0)[source]¶ The
exponential_kernel
is closely related to thegaussian_kernel
, with only the square of the norm left out. It is also anrbf_kernel
. Note that the adjustable parameter,sigma
, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.The kernel is given by:
\(k(x, y) = exp( -||x-y|| / 2\sigma^2 )\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.sigma : float, optional (default=1.0)
The exponential tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
gaussian_kernel
(X, Y=None, sigma=1.0)[source]¶ The
gaussian_kernel
is closely related to theexponential_kernel
. It is also anrbf_kernel
. Note that the adjustable parameter,sigma
, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.The kernel is given by:
\(k(x, y) = exp( -||x-y||^2 / 2\sigma^2 )\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.sigma : float, optional (default=1.0)
The exponential tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
inverse_multiquadric_kernel
(X, Y=None, constant=1.0)[source]¶ The
inverse_multiquadric_kernel
, as with thegaussian_kernel
, results in a kernel matrix with full rank (Micchelli, 1986) and thus forms an infinite dimension feature space.The kernel is given by:
\(k(x, y) = 1 / sqrt( -||x-y||^2 + c^2 )\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.constant : float, optional (default=1.0)
The linear tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
laplace_kernel
(X, Y=None, sigma=1.0)[source]¶ The
laplace_kernel
is completely equivalent to theexponential_kernel
, except for being less sensitive for changes in thesigma
parameter. Being equivalent, it is also anrbf_kernel
.The kernel is given by:
\(k(x, y) = exp( -||x-y|| / \sigma )\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.sigma : float, optional (default=1.0)
The exponential tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
linear_kernel
(X, Y=None, constant=0.0)[source]¶ The
linear_kernel
is the simplest kernel function. It is given by the inner product <x,y> plus an optionalconstant
parameter. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with alinear_kernel
is the same as standard PCA.The kernel is given by:
\(k(x, y) = x^Ty + c\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.constant : float, optional (default=0.0)
The linear tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
multiquadric_kernel
(X, Y=None, constant=0.0)[source]¶ The
multiquadric_kernel
can be used in the same situations as the Rational Quadratic kernel. As is the case with the Sigmoid kernel, it is also an example of an non-positive definite kernel.The kernel is given by:
\(k(x, y) = sqrt( -||x-y||^2 + c^2 )\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.constant : float, optional (default=0.0)
The linear tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
polynomial_kernel
(X, Y=None, alpha=1.0, degree=1.0, constant=1.0)[source]¶ The
polynomial_kernel
is a non-stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized. Adjustable parameters are the slope (alpha
), the constant term (constant
), and the polynomial degree (degree
).The kernel is given by:
\(k(x, y) = ( \alpha x^Ty + c)^d\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.alpha : float, optional (default=1.0)
The slope tuning parameter.
degree : float, optional (default=1.0)
The polynomial degree tuning parameter.
constant : float, optional (default=1.0)
The linear tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
power_kernel
(X, Y=None, degree=1.0)[source]¶ The
power_kernel
is also known as the (unrectified) triangular kernel. It is an example of scale-invariant kernel (Sahbi and Fleuret, 2004) and is also only conditionally positive definite.The kernel is given by:
\(k(x, y) = -||x-y||^d\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.degree : float, optional (default=1.0)
The polynomial degree tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
rbf_kernel
(X, Y=None, sigma=1.0)[source]¶ The
rbf_kernel
is closely related to theexponential_kernel
andgaussian_kernel
. Note that the adjustable parameter,sigma
, plays a major role in the performance of the kernel and should be carefully tuned. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.The kernel is given by:
\(k(x, y) = exp(- \gamma * ||x-y||^2)\)where:
\(\gamma = 1/( \sigma ^2)\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.sigma : float, optional (default=1.0)
The exponential tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html
-
skutil.metrics.
spline_kernel
(X, Y=None)[source]¶ - The
spline_kernel
is given as a piece-wise cubic polynomial, as derived in the works by Gunn (1998).The kernel is given by:
\(k(x, y) = 1 + xy + xy * min(x,y) - (1/2 * (x+y)) * min(x,y)^2 + 1/3 * min(x,y)^3\)Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.- Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.
Returns: res : float
The result of the kernel computation.
-
skutil.metrics.
tanh_kernel
(X, Y=None, constant=0.0, alpha=1.0)[source]¶ The
tanh_kernel
(Hyperbolic Tangent Kernel) is also known as the Sigmoid Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons.The kernel is given by:
\(k(x, y) = tanh (\alpha x^T y + c)\)It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice.
There are two adjustable parameters in the sigmoid kernel, the slope
alpha
and the interceptconstant
. A common value for alpha is 1/N, where N is the data dimension. A more detailed study on sigmoid kernels can be found in the works by Hsuan-Tien and Chih-Jen.Parameters: X : array_like (float), shape=(n_samples, n_features)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.Y : array_like (float), shape=(n_samples, n_features), optional (default=None)
The array of pandas DataFrame on which to compute the kernel. If
Y
is None, the kernel will be computed withX
.constant : float, optional (default=0.0)
The linear tuning parameter.
alpha : float, optional (default=1.0)
The slope tuning parameter.
Returns: c : float
The result of the kernel computation.
References
Souza, Cesar R., Kernel Functions for Machine Learning Applications http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html