skutil.h2o module

skutil.h2o bridges the functionality between sklearn and H2O with custom encoders, grid search functionality, and over/undersampling class balancers.

class skutil.h2o.BaseH2OFeatureSelector(feature_names=None, target_feature=None, exclude_features=None, min_version='any', max_version=None)[source]

Bases: skutil.h2o.base.BaseH2OTransformer

Base class for all H2O selectors.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

min_version : str or float, optional (default=’any’)

The minimum version of h2o that is compatible with the transformer

max_version : str or float, optional (default=None)

The maximum version of h2o that is compatible with the transformer

.. versionadded:: 0.1.0 :

Methods

fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
transform(X)[source]

Transform the test frame, after fitting the transformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The test frame to transform

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The transformed frame

class skutil.h2o.BaseH2OFunctionWrapper(target_feature=None, min_version='any', max_version=None)[source]

Bases: sklearn.base.BaseEstimator

Base class for all H2O estimators or functions.

Parameters:

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit).

min_version : str or float, optional (default=’any’)

The minimum version of h2o that is compatible with the transformer

max_version : str or float, optional (default=None)

The maximum version of h2o that is compatible with the transformer

.. versionadded:: 0.1.0 :

Methods

get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
static load(location)[source]

Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk. If the instance is of a more complex class, i.e., one that contains an H2OEstimator, this method will handle loading these models separately and outside of the constraints of the pickle package.

Note that this is a static method and should be called accordingly:

>>> def load_and_transform(X):
...     from skutil.h2o.select import H2OMulticollinearityFilterer
...     mcf = H2OMulticollinearityFilterer.load(location='example/path.pkl')
...     return mcf.transform(X)
>>>
>>> load_and_transform(X) 

Some classes define their own load functionality, and will not work as expected if called in the following manner:

>>> def load_pipe():
...     return BaseH2OFunctionWrapper.load('path/to/h2o/pipeline.pkl')
>>>
>>> pipe = load_pipe() 

This is because of the aforementioned situation wherein some classes handle saves and loads of H2OEstimator objects differently. Thus, any class that is being loaded should be statically referenced at the level of lowest abstraction possible:

>>> def load_pipe():
...     from skutil.h2o.pipeline import H2OPipeline
...     return H2OPipeline.load('path/to/h2o/pipeline.pkl')
>>>
>>> pipe = load_pipe() 
Parameters:

location : str

The location where the persisted model resides.

Returns:

m : BaseH2OFunctionWrapper

The unpickled instance of the model

max_version

Returns the max version of H2O that is compatible with the BaseH2OFunctionWrapper instance. Some classes differ in their support for H2O versions, due to changes in the underlying API.

Returns:

mv : string, or None

If there is a max version associated with the BaseH2OFunctionWrapper, returns it as a string, otherwise returns None.

min_version

Returns the min version of H2O that is compatible with the BaseH2OFunctionWrapper instance. Some classes differ in their support for H2O versions, due to changes in the underlying API.

Returns:

mv : string

If there is a min version associated with the BaseH2OFunctionWrapper, returns it as a string, otherwise returns ‘any’

save(location, warn_if_exists=True, **kwargs)[source]

Saves the BaseH2OFunctionWrapper to disk. If the instance is of a more complex class, i.e., one that contains an H2OEstimator, this method will handle saving these models separately and outside of the constraints of the pickle package. Any key-word arguments will be passed to the _save_internal method (if it exists).

Parameters:

location : str

The absolute path of location where the transformer should be saved.

warn_if_exists : bool, optional (default=True)

Warn the user that location exists if True.

class skutil.h2o.BaseH2OTransformer(feature_names=None, target_feature=None, exclude_features=None, min_version='any', max_version=None)[source]

Bases: skutil.h2o.base.BaseH2OFunctionWrapper, sklearn.base.TransformerMixin

Base class for all H2OTransformers.

Parameters:

feature_names : array_like, str

The list of names on which to fit the feature selector.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names during the fit.

min_version : str or float, optional (default=’any’)

The minimum version of h2o that is compatible with the transformer

max_version : str or float, optional (default=None)

The maximum version of h2o that is compatible with the transformer

.. versionadded:: 0.1.0 :

Methods

fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
fit_transform(frame)[source]

Fit the model and then immediately transform the input (training) frame with the fit parameters.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The training frame

Returns:

ft : H2OFrame, shape=(n_samples, n_features)

The transformed training frame

class skutil.h2o.H2OFScoreKBestSelector(feature_names=None, target_feature=None, exclude_features=None, cv=3, k=10, iid=True)[source]

Bases: skutil.h2o.one_way_fs._BaseH2OFScoreSelector

Select the top k features based on the F-score, using the h2o_f_classif method.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

cv : int or H2OBaseCrossValidator, optional (default=3)

Univariate feature selection can very easily remove features erroneously or cause overfitting. Using cross validation, we can more confidently select the features to drop.

k : int, optional (default=10)

The number of features to keep.

iid : bool, optional (default=True)

Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold

min_version : str or float, optional (default=’any’)

The minimum version of h2o that is compatible with the transformer

max_version : str or float, optional (default=None)

The maximum version of h2o that is compatible with the transformer

Attributes:

scores_ : np.ndarray, float

The score array, adjusted for n_folds

p_values_ : np.ndarray, float

The p-value array, adjusted for n_folds

.. versionadded:: 0.1.2 :

Methods

fit(X) Fit the F-score feature selector.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the F-score feature selector.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training frame on which to fit

Returns:

self :

class skutil.h2o.H2OFScorePercentileSelector(feature_names=None, target_feature=None, exclude_features=None, cv=3, percentile=10, iid=True)[source]

Bases: skutil.h2o.one_way_fs._BaseH2OFScoreSelector

Select the top percentile of features based on the F-score, using the h2o_f_classif method.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

cv : int or H2OBaseCrossValidator, optional (default=3)

Univariate feature selection can very easily remove features erroneously or cause overfitting. Using cross validation, we can more confidently select the features to drop.

percentile : int, optional (default=10)

The percent of features to keep.

iid : bool, optional (default=True)

Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold

min_version : str or float, optional (default=’any’)

The minimum version of h2o that is compatible with the transformer

max_version : str or float, optional (default=None)

The maximum version of h2o that is compatible with the transformer

Attributes:

scores_ : np.ndarray, float

The score array, adjusted for n_folds

p_values_ : np.ndarray, float

The p-value array, adjusted for n_folds

.. versionadded:: 0.1.2 :

Methods

fit(X) Fit the F-score feature selector.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the F-score feature selector.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training frame on which to fit

Returns:

self :

class skutil.h2o.H2OFeatureDropper(feature_names=None, target_feature=None, exclude_features=None)[source]

Bases: skutil.h2o.select.BaseH2OFeatureSelector

A very simple class to be used at the beginning or any stage of an H2OPipeline that will drop the given features from the remainder of the pipe.

This is useful when you have many features, but only a few to drop. Rather than passing the feature_names arg as the delta between many features and the several to drop, this allows you to drop them and keep feature_names as None in future steps.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

Attributes:

drop_ : list (str)

These are the features that will be dropped by the FeatureDropper

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the H2OTransformer.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the H2OTransformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

return self :

class skutil.h2o.H2OGainsRandomizedSearchCV(estimator, param_grid, feature_names, target_feature, exposure_feature, loss_feature, premium_feature=None, n_iter=10, random_state=None, scoring='lift', scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias', error_score=nan, error_behavior='warn')[source]

Bases: skutil.h2o.grid_search.H2ORandomizedSearchCV

A grid search that scores based on actuarial metrics (See skutil.metrics.GainsStatisticalReport). This is a more customized form of grid search, and must use a gains metric provided by the GainsStatisticalReport.

Parameters:

estimator : H2OPipeline or H2OEstimator

The estimator to fit. Either an :class:skutil.h2o.H2OPipeline or a H2OEstimator. If the estimator is a pipeline, it must contain an estimator as the final step.

param_grid : dict

The hyper parameter grid over which to search. If estimator is an :class:skutil.h2o.H2OPipeline, the param_grid should be in the form of {'stepname__param':[values]}; if there are not named steps (i.e., if estimator is an H2OEstimator), param_grid should be in the form of {'param':[values]}. Note that a param_grid with named step parameters in the absence of named steps will raise an error.

feature_names : iterable (str)

The list of feature names on which to fit

target_feature : str

The name of the target

exposure_feature : str

The name of the exposure feature

loss_feature : str

The name of the loss feature

premium_feature : str

The name of the premium feature

n_iter : int, optional (default=10)

The number of iterations to fit. Note that n_iter * cv.get_n_splits will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.

random_state : int, optional (default=None)

The random state for the search

scoring : str, optional (default=’lift’)

One of {‘lift’,’gini’} or other valid GainsStatisticalReport scoring metrics.

scoring_params : dict, optional (default=None)

Any kwargs to be passed to the scoring function for scoring at each iteration.

cv : int or H2OCrossValidator, optional (default=5)

The number of folds to be fit for cross validation. Note that n_iter * cv.get_n_splits will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.

verbose : int, optional (default=0)

The level of verbosity. 1, 2 or greater. A verbosity of 0 will produce no output other than the default H2O fit/predict output. A verbosity of 1 will print the selected parameters at each fold and iteration, and a verbosity of 2 will produce all of the aforementioned output plus the intermediate fold scores.

iid : bool, optional (default=True)

Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

validation_frame : H2OFrame, optional (default=None)

Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.

minimize : str, optional (default=’bias’)

How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.

error_score : float, optional (default=np.nan)

The default score to use in the case of a pd.qcuts ValueError (when there are non-unique bin edges)

error_behavior : str, optional (default=’warn’)

How to handle the pd.qcut ValueError. One of {‘warn’,’raise’,’ignore’}

.. versionadded:: 0.1.0 :

Methods

download_pojo(\*args, \*\*kwargs) This method is injected at runtime if the best_estimator_ is an instance of an H2OEstimator.
fit(frame) Fit the grid search.
fit_predict(frame) First, fits the grid search and then generates predictions on the training frame using the best_estimator_.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OSearchCV from disk.
plot(timestep, metric) Plot an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).
predict(\*args, \*\*kwargs) After the grid search is fit, generates predictions on the test frame using the best_estimator_.
report_scores() Create a dataframe report for the fitting and scoring of the gains search.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
score(frame) Predict and score on a new frame.
set_params(\*\*params) Set the parameters of this estimator.
varimp(\*args, \*\*kwargs) Get the variable importance, if the final estimator implements such a function.
fit(frame)[source]

Fit the grid search.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The training frame on which to fit.

report_scores()[source]

Create a dataframe report for the fitting and scoring of the gains search. Will report lift, gini and any other relevant metrics. If a validation set was included, will also report validation scores.

Returns:

rdf : pd.DataFrame, shape=(n_iter, n_params)

The grid search report

score(frame)[source]

Predict and score on a new frame. Note that this method will not store performance metrics in the report that report_score generates.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The test frame on which to predict and score performance.

Returns:

scor : float

The score on the testing frame

class skutil.h2o.H2OGridSearchCV(estimator, param_grid, feature_names, target_feature, scoring=None, scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias')[source]

Bases: skutil.h2o.grid_search.BaseH2OSearchCV

An exhaustive grid search that will fit models across the entire hyperparameter grid provided.

Parameters:

estimator : H2OPipeline or H2OEstimator

The estimator to fit. Either an :class:skutil.h2o.H2OPipeline or a H2OEstimator. If the estimator is a pipeline, it must contain an estimator as the final step.

param_grid : dict

The hyper parameter grid over which to search. If estimator is an :class:skutil.h2o.H2OPipeline, the param_grid should be in the form of {'stepname__param':[values]}; if there are not named steps (i.e., if estimator is an H2OEstimator), param_grid should be in the form of {'param':[values]}. Note that a param_grid with named step parameters in the absence of named steps will raise an error.

feature_names : iterable (str)

The list of feature names on which to fit

target_feature : str

The name of the target

scoring : str, optional (default=’lift’)

A valid scoring metric, i.e., “accuracy_score”. See skutil.h2o.grid_search.SCORERS for a comprehensive list.

scoring_params : dict, optional (default=None)

Any kwargs to be passed to the scoring function for scoring at each iteration.

cv : int or H2OCrossValidator, optional (default=5)

The number of folds to be fit for cross validation.

verbose : int, optional (default=0)

The level of verbosity. 1, 2 or greater. A verbosity of 0 will produce no output other than the default H2O fit/predict output. A verbosity of 1 will print the selected parameters at each fold and iteration, and a verbosity of 2 will produce all of the aforementioned output plus the intermediate fold scores.

iid : bool, optional (default=True)

Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

validation_frame : H2OFrame, optional (default=None)

Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.

minimize : str, optional (default=’bias’)

How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.

.. versionadded:: 0.1.0 :

Methods

download_pojo(\*args, \*\*kwargs) This method is injected at runtime if the best_estimator_ is an instance of an H2OEstimator.
fit(frame) Fit the grid search.
fit_predict(frame) First, fits the grid search and then generates predictions on the training frame using the best_estimator_.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OSearchCV from disk.
plot(timestep, metric) Plot an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).
predict(\*args, \*\*kwargs) After the grid search is fit, generates predictions on the test frame using the best_estimator_.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
score(frame) After the grid search is fit, generates and scores the predictions of the best_estimator_.
set_params(\*\*params) Set the parameters of this estimator.
varimp(\*args, \*\*kwargs) Get the variable importance, if the final estimator implements such a function.
fit(frame)[source]

Fit the grid search.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The training frame on which to fit.

class skutil.h2o.H2OInteractionTermTransformer(feature_names=None, target_feature=None, exclude_features=None, interaction_function=None, name_suffix='I', only_return_interactions=False)[source]

Bases: skutil.h2o.base.BaseH2OTransformer

A class that will generate interaction terms between selected columns. An interaction captures some relationship between two independent variables in the form of:

\(In = (x_i * x_j)\)

Note that the H2OInteractionTermTransformer will only operate on the feature_names, and at the transform point will return ALL features plus the newly generated ones unless otherwise specified in the only_return_interactions parameter.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

interaction : callable, optional (default=None)

A callable for interactions. Default None will result in multiplication of two Series objects

name_suffix : str, optional (default=’I’)

The suffix to add to the new feature name in the form of <feature_x>_<feature_y>_<suffix>

only_return_interactions : bool, optional (default=False)

If set to True, will only return features in feature_names and their respective generated interaction terms.

Attributes:

fun_ : callable

The interaction term function assigned in the fit method.

.. versionadded:: 0.1.0 :

Methods

fit(frame) Fit the transformer.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Perform the interaction term expansion.
fit(frame)[source]

Fit the transformer.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

self :

transform(X)[source]

Perform the interaction term expansion.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The test data to transform.

Returns:

frame : H2OFrame, shape=(n_samples, n_features)

The expanded (interacted) test data.

class skutil.h2o.H2OKFold(n_folds=3, shuffle=False, random_state=None)[source]

Bases: skutil.h2o.split._H2OBaseKFold

K-folds cross-validator for an H2OFrame.

Parameters:

n_folds : int, optional (default=3)

The number of splits

shuffle : bool, optional (default=False)

Whether to shuffle indices

random_state : int or RandomState, optional (default=None)

The random state for the split

Methods

get_n_splits() Get the number of splits or folds.
split(frame[, y]) Split the frame.
class skutil.h2o.H2OLabelEncoder[source]

Bases: skutil.h2o.base.BaseH2OTransformer

Encode categorical values in a H2OFrame (single column) into ordinal labels 0 - len(column) - 1.

Attributes:

classes_ : np.ndarray

The unique class levels

.. versionadded:: 0.1.0 :

Examples

>>> def example():
...     import pandas as pd
...     from skutil.h2o import from_pandas
...     from skutil.h2o.transform import H2OLabelEncoder
...     
...     x = pd.DataFrame.from_records(data=[
...                 [5, 4],
...                 [6, 2],
...                 [5, 1],
...                 [7, 9],
...                 [7, 2]], columns=['C1', 'C2'])
...     
...     X = from_pandas(x)
...     encoder = H2OLabelEncoder()
...     encoder.fit_transform(X['C1'])
>>>
>>> example() 
  C1
----
   0
   1
   0
   2
   2
[5 rows x 1 column]

Methods

fit(column)
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(column)
fit(column)[source]
transform(column)[source]
class skutil.h2o.H2OMulticollinearityFilterer(feature_names=None, target_feature=None, exclude_features=None, threshold=0.85, na_warn=True, na_rm=False, use='complete.obs')[source]

Bases: skutil.h2o.select.BaseH2OFeatureSelector

Filter out features with a correlation greater than the provided threshold. When a pair of correlated features is identified, the mean absolute correlation (MAC) of each feature is considered, and the feature with the highest MAC is discarded.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

threshold : float, optional (default=0.85)

The threshold above which to filter correlated features

na_warn : bool, optional (default=True)

Whether to warn if any NAs are present

na_rm : bool, optional (default=False)

Whether to remove NA values

use : str, optional (default “complete.obs”)

One of {‘complete.obs’,’all.obs’,’everything’}. A string indicating how to handle missing values.

Attributes:

drop_ : list, string

The columns to drop

mean_abs_correlations_ : list, float

The corresponding mean absolute correlations of each drop_ name

correlations_ : named tuple

A list of tuples with each tuple containing the two correlated features, the level of correlation, the feature that was selected for dropping, and the mean absolute correlation of the dropped feature.

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the H2OTransformer.
fit_transform(X) Fit the multicollinearity filterer and return the transformed H2OFrame, X.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the H2OTransformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

return self :

fit_transform(X)[source]

Fit the multicollinearity filterer and return the transformed H2OFrame, X.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The transformed training data

class skutil.h2o.H2ONearZeroVarianceFilterer(feature_names=None, target_feature=None, exclude_features=None, threshold=1e-06, na_warn=True, na_rm=False, use='complete.obs', strategy='variance')[source]

Bases: skutil.h2o.select.BaseH2OFeatureSelector

Identify and remove any features that have a variance below a certain threshold. There are two possible strategies for near-zero variance feature selection:

  1. Select features on the basis of the actual variance they exhibit. This is only relevant when the features are real numbers.
  2. Remove features where the ratio of the frequency of the most prevalent value to that of the second-most frequent value is large, say 20 or above (Kuhn & Johnson[1]).
Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

threshold : float, optional (default=1e-6)

The threshold below which to declare “zero variance”

na_warn : bool, optional (default=True)

Whether to warn if any NAs are present

na_rm : bool, optional (default=False)

Whether to remove NA values

use : str, optional (default “complete.obs”)

One of {‘complete.obs’,’all.obs’,’everything’} A string indicating how to handle missing values.

strategy : str, optional (default=’variance’)

The strategy by which feature selection should be performed, one of (‘variance’, ‘ratio’). If strategy is ‘variance’, features will be selected based on the amount of variance they exhibit; those that are low-variance (below threshold) will be removed. If strategy is ‘ratio’, features are dropped if the most prevalent value is represented at a ratio greater than threshold to the second-most frequent value. Note that if strategy is ‘ratio’, threshold must be greater than 1.

Attributes:

drop_ : list, string

The columns to drop

var_ : dict

The dropped columns mapped to their corresponding variances or ratios, depending on the strategy

References

[R4]Kuhn, M. & Johnson, K. “Applied Predictive Modeling” (2013). New York, NY: Springer.

New in version 0.1.0.

Methods

fit(X) Fit the near zero variance filterer, return the transformed X frame.
fit_transform(X) Fit the near zero variance filterer.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the near zero variance filterer, return the transformed X frame.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

self :

fit_transform(X)[source]

Fit the near zero variance filterer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The transformed training data

class skutil.h2o.H2OOversamplingClassBalancer(target_feature, ratio=0.2, shuffle=True)[source]

Bases: skutil.h2o.balance._BaseH2OBalancer

Oversample the minority classes until they are represented at the target proportion to the majority class.

Parameters:

target_feature : str

The name of the response column. The response column must be more than a single class and less than skutil.preprocessing.balance.BalancerMixin._max_classes

ratio : float, optional (default=0.2)

The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input frame

shuffle : bool, optional (default=True)

Whether or not to shuffle rows on return

Examples

Consider the following example: with a ratio of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)

>>> def example():
...     import h2o
...     import pandas as pd
...     import numpy as np
...     from skutil.h2o.frame import value_counts
...     from skutil.h2o import from_pandas
...     
...     # initialize h2o
...     h2o.init()
...
...     # read into pandas
...     x = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A'])
...     
...     # load into h2o
...     X = from_pandas(x)
...     
...     # initialize sampler
...     sampler = H2OOversamplingClassBalancer(target_feature="A", ratio=0.5)
...     
...     # do balancing
...     X_balanced = sampler.balance(X)
...     value_counts(X_balanced)
>>>
>>> example() 
0    100
1     50
2     50
Name A, dtype: int64

New in version 0.1.0.

Methods

balance(X) Apply the oversampling balance operation.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
balance(X)[source]

Apply the oversampling balance operation. Oversamples the minority class to the provided ratio of minority class(es) : majority class.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The imbalanced dataset.

Returns:

Xb : H2OFrame, shape=(n_samples, n_features)

The balanced H2OFrame

class skutil.h2o.H2OPipeline(steps, feature_names=None, target_feature=None, exclude_from_ppc=None, exclude_from_fit=None)[source]

Bases: skutil.h2o.base.BaseH2OFunctionWrapper, skutil.h2o.base.VizMixin

Create a sklearn-esque pipeline of H2O steps finished with an optional H2OEstimator. Note that as of version 0.1.0, the behavior of the H2OPipeline has slightly changed, given the inclusion of the exclude_from_ppc and exclude_from_fit parameters.

The pipeline, at the core, is comprised of a list of length-two tuples in the form of ('name', SomeH2OTransformer()), punctuated with an optional H2OEstimator as the final step. The pipeline will procedurally fit each stage, transforming the training data prior to fitting the next stage. When predicting or transforming new (test) data, each stage calls either transform or predict at the respective step.

On the topic of exclusions and ``feature_names``:

Prior to version 0.1.0, H2OTransformers did not take the keyword exclude_features. Its addition necessitated two new keywords in the H2OPipeline, and a slight change in behavior of feature_names:

  • exclude_from_ppc - If set in the H2OPipeline constructor, these features

    will be universally omitted from every preprocessing stage. Since exclude_features can be set individually in each separate transformer, in the case that exclude_features has been explicitly set, the exclusions in that respective stage will include the union of exclude_from_ppc and exclude_features.

  • exclude_from_fit - If set in the H2OPipeline constructor, these features

    will be omitted from the training_cols_ fit attribute, which are the columns passed to the final stage in the pipeline.

  • feature_names - The former behavior of the H2OPipeline only used feature_names

    in the fit of the first transformer, passing the remaining columns to the next transformer as the feature_names parameter. The new behavior is more discriminating in the case of explicitly-set attributes. In the case where a transformer’s feature_names parameter has been explicitly set, only those names will be used in the fit. This is useful in cases where someone may only want to, for instance, drop one of two multicollinear features using the H2OMulticollinearityFilterer rather than fitting against the entire dataset. It also adheres to the now expected behavior of the exclusion parameters.

Parameters:

steps : list

A list of named tuples wherein element 1 of each tuple is an instance of a BaseH2OTransformer or an H2OEstimator.

feature_names : iterable, optional (default=None)

The names of features on which to fit the first transformer in the pipeline. The next transformer will be fit with feature_names as the result-set columns from the previous transformer, minus any exclusions or target features.

target_feature : str, optional (default=None)

The name of the target feature

exclude_from_ppc : iterable, optional (default=None)

Any names to be excluded from any preprocessor fits. Since the exclude_features can be set in respective steps in each preprocessor, these features will be considered as global exclusions and will be appended to any individually set exclusion features.

exclude_from_fit : iterable, optional (default=None)

Any names to be excluded from the final model fit

Attributes:

training_cols_ : list (str), shape=(n_features,)

The columns that are retained for training purposes after the _pre_transform operation, which fits the series of transformers but not the final estimator.

.. versionadded:: 0.1.0 :

Examples

The following is a simple example of an H2OPipeline in use:

>>> def example():
...     import h2o
...     from h2o.estimators import H2ORandomForestEstimator
...     from skutil.h2o import H2OMulticollinearityFilterer
...     from skutil.h2o import load_iris_h2o
...     
...     
...     # initialize h2o
...     h2o.init()
...     
...     # load into h2o
...     X = load_iris_h2o(tgt_name="Species") 
...
...     # get feature names and target
...     x, y = X.columns[:-1], X.columns[-1]
...
...     # define and fit the pipe
...     pipe = H2OPipeline([
...         ('mcf', H2OMulticollinearityFilterer()),
...         ('clf', H2ORandomForestEstimator())
...     ], feature_names=x, target_feature=y).fit()
>>>     
>>> example() 

This a more advanced example of the H2OPipeline (including use of the exclude_from_ppc and exclude_from_fit parameters):

>>> def example():
...     import h2o
...     from skutil.h2o import load_boston_h2o
...     from skutil.h2o import h2o_train_test_split
...     from skutil.h2o.transform import H2OSelectiveScaler
...     from skutil.h2o.select import H2OMulticollinearityFilterer
...     from h2o.estimators import H2OGradientBoostingEstimator
...     
...     
...     # initialize h2o
...     h2o.init() 
...     
...     # load into h2o
...     X = load_boston_h2o(include_tgt=True, shuffle=True, tgt_name='target') 
...
...     # this splits our data
...     X_train, X_test = h2o_train_test_split(X, train_size=0.7)
...     
...     
...     # Declare our pipe - this one is intentionally a bit complex in behavior
...     pipe = H2OPipeline([
...             ('scl', H2OSelectiveScaler(feature_names=['B','PTRATIO','CRIM'])),  # will ONLY operate on these
...             ('mcf', H2OMulticollinearityFilterer(exclude_features=['CHAS'])),   # will exclude this & 'TAX'
...             ('gbm', H2OGradientBoostingEstimator())
...         ], exclude_from_ppc=['TAX'], # excluded from all preprocessor fits
...            feature_names=None,       # fit the first stage on ALL features (minus exceptions)
...            target_feature='target'   # will be excluded from all preprocessor fits, as it's the target
...     ).fit(X_train)
>>>
>>> example() 

Methods

download_pojo(\*args, \*\*kwargs) This method is injected at runtime if the _final_estimator is an instance of an H2OEstimator.
fit(frame) Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
fit_predict(\*args, \*\*kwargs) Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
fit_transform(\*args, \*\*kwargs) Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of H2OPipeline from disk.
plot(\*args, \*\*kwargs) If the _final_estimator is an H2OEstimator, this method is injected at runtime.
predict(\*args, \*\*kwargs) Applies transforms to the data, and the predict method of the final estimator.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters for this pipeline.
transform(\*args, \*\*kwargs) Applies transforms to the data.
varimp(\*args, \*\*kwargs) Get the variable importance, if the final estimator implements such a function.
download_pojo(*args, **kwargs)[source]

This method is injected at runtime if the _final_estimator is an instance of an H2OEstimator. This method downloads the POJO from a fit estimator.

Parameters:

path : string, optional (default=””)

Path to folder in which to save the POJO.

get_jar : bool, optional (default=True)

Whether to get the jar from the POJO.

Returns:

None or string :

Returns None if path is “” else, the filepath where the POJO was saved.

fit(frame)[source]

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

Training data on which to fit. Must fulfill input requirements of first step of the pipeline.

Returns:

self :

fit_predict(*args, **kwargs)[source]

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. Finally, predict on the final step.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

Training data. Must fulfill input requirements of first step of the pipeline.

fit_transform(*args, **kwargs)[source]

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. Finally, transform on the final step.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

Training data. Must fulfill input requirements of first step of the pipeline.

Returns:

Xt : H2OFrame, shape=(n_samples, n_features)

The transformed training data

static load(location)[source]

Loads a persisted state of an instance of H2OPipeline from disk. This method will handle loading H2OEstimator models separately and outside of the constraints of the pickle package.

Note that this is a static method and should be called accordingly:

>>> def load_pipe():
...     return H2OPipeline.load('path/to/h2o/pipeline.pkl') # GOOD!
>>>
>>> pipe = load_pipe() 

Also note that since H2OPipeline can contain an H2OEstimator, it’s load functionality differs from that of its superclass, BaseH2OFunctionWrapper and will not function properly if called at the highest level of abstraction:

>>> def load_pipe():
...     return BaseH2OFunctionWrapper.load('path/to/h2o/pipeline.pkl') # BAD!
>>>
>>> pipe = load_pipe() 

Furthermore, trying to load a different type of BaseH2OFunctionWrapper from this method will raise a TypeError:

>>> def load_pipe():
...     return H2OPipeline.load('path/to/some/other/transformer.pkl') # BAD!
>>>
>>> pipe = load_pipe() 
Parameters:

location : str

The location where the persisted H2OPipeline model resides.

Returns:

model : H2OPipeline

The unpickled instance of the H2OPipeline model

named_steps

Generates a dictionary of all of the stages where the stage name is the key, and the stage is the value. Note that dictionaries are not guaranteed a specific order!!!

Returns:

d : dict

The dictionary of named steps.

plot(*args, **kwargs)[source]

If the _final_estimator is an H2OEstimator, this method is injected at runtime. This method plots an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).

Parameters:

timestep : str

A timestep as defined in the H2O API. One of (“AUTO”, “duration”, “number_of_trees”).

metric : str

The performance metric to evaluate. One of (“log_likelihood”, “objective”, “MSE”, “AUTO”)

predict(*args, **kwargs)[source]

Applies transforms to the data, and the predict method of the final estimator. Valid only if the final estimator implements predict.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

Data to predict on. Must fulfill input requirements of first step of the pipeline.

set_params(**params)[source]

Set the parameters for this pipeline. Will revalidate the steps in the estimator prior to setting the parameters. Parameters is a **kwargs-style dictionary whose keys should be prefixed by the name of the step targeted and a double underscore:

>>> def example():
...     from skutil.h2o.select import H2OMulticollinearityFilterer
...     from h2o.estimators import H2ORandomForestEstimator
...     
...     pipe = H2OPipeline([
...         ('mcf', H2OMulticollinearityFilterer()),
...         ('rf',  H2ORandomForestEstimator())
...     ])
...
...     pipe.set_params(**{
...         'rf__ntrees':     100,
...         'mcf__threshold': 0.75
...     })
>>>
>>> example() 
Returns:self :
transform(*args, **kwargs)[source]

Applies transforms to the data. Valid only if the final estimator implements predict.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

Data to predict on. Must fulfill input requirements of first step of the pipeline.

Returns:

Xt : H2OFrame, shape=(n_samples, n_features)

The transformed test data

varimp(*args, **kwargs)[source]

Get the variable importance, if the final estimator implements such a function.

Parameters:

use_pandas : bool, optional (default=True)

Whether to return a pandas dataframe

class skutil.h2o.H2ORandomizedSearchCV(estimator, param_grid, feature_names, target_feature, n_iter=10, random_state=None, scoring=None, scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias')[source]

Bases: skutil.h2o.grid_search.BaseH2OSearchCV

A grid search that operates over a random sub-hyperparameter space at each iteration.

Parameters:

estimator : H2OPipeline or H2OEstimator

The estimator to fit. Either an :class:skutil.h2o.H2OPipeline or a H2OEstimator. If the estimator is a pipeline, it must contain an estimator as the final step.

param_grid : dict

The hyper parameter grid over which to search. If estimator is an :class:skutil.h2o.H2OPipeline, the param_grid should be in the form of {'stepname__param':[values]}; if there are not named steps (i.e., if estimator is an H2OEstimator), param_grid should be in the form of {'param':[values]}. Note that a param_grid with named step parameters in the absence of named steps will raise an error.

feature_names : iterable (str)

The list of feature names on which to fit

target_feature : str

The name of the target

n_iter : int, optional (default=10)

The number of iterations to fit. Note that n_iter * cv.get_n_splits will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.

random_state : int, optional (default=None)

The random state for the search

scoring : str, optional (default=’lift’)

A valid scoring metric, i.e., “accuracy_score”. See skutil.h2o.grid_search.SCORERS for a comprehensive list.

scoring_params : dict, optional (default=None)

Any kwargs to be passed to the scoring function for scoring at each iteration.

cv : int or H2OCrossValidator, optional (default=5)

The number of folds to be fit for cross validation. Note that n_iter * cv.get_n_splits will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.

verbose : int, optional (default=0)

The level of verbosity. 1, 2 or greater. A verbosity of 0 will produce no output other than the default H2O fit/predict output. A verbosity of 1 will print the selected parameters at each fold and iteration, and a verbosity of 2 will produce all of the aforementioned output plus the intermediate fold scores.

iid : bool, optional (default=True)

Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

validation_frame : H2OFrame, optional (default=None)

Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.

minimize : str, optional (default=’bias’)

How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.

.. versionadded:: 0.1.0 :

Methods

download_pojo(\*args, \*\*kwargs) This method is injected at runtime if the best_estimator_ is an instance of an H2OEstimator.
fit(frame) Fit the grid search.
fit_predict(frame) First, fits the grid search and then generates predictions on the training frame using the best_estimator_.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OSearchCV from disk.
plot(timestep, metric) Plot an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).
predict(\*args, \*\*kwargs) After the grid search is fit, generates predictions on the test frame using the best_estimator_.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
score(frame) After the grid search is fit, generates and scores the predictions of the best_estimator_.
set_params(\*\*params) Set the parameters of this estimator.
varimp(\*args, \*\*kwargs) Get the variable importance, if the final estimator implements such a function.
fit(frame)[source]

Fit the grid search.

Parameters:

frame : H2OFrame, shape=(n_samples, n_features)

The training frame on which to fit.

class skutil.h2o.H2OSafeOneHotEncoder(feature_names=None, target_feature=None, exclude_features=None, drop_after_encoded=True)[source]

Bases: skutil.h2o.base.BaseH2OTransformer

Given a set of feature_names, one-hot encodes (dummies) a set of vecs into an expanded set of dummied columns. Will drop the original columns after transformation, unless otherwise specified.

Parameters:

feature_names : array_like (str) shape=(n_features,), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : array_like (str) shape=(n_features,), optional (default=None)

Any names that should be excluded from feature_names

drop_after_encoded : bool (default=True)

Whether to drop the original columns after transform

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the one hot encoder.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a new frame after fit.
fit(X)[source]

Fit the one hot encoder.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training frame to fit

Returns:

self :

transform(X)[source]

Transform a new frame after fit.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The frame to transform

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The transformed H2OFrame

class skutil.h2o.H2OSelectiveImputer(feature_names=None, target_feature=None, exclude_features=None, def_fill='mean')[source]

Bases: skutil.h2o.transform._H2OBaseImputer

The selective imputer provides extreme flexibility and simplicity in imputation tasks. Rather than imposing one strategy across an entire frame, different strategies can be mapped to respective features.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

def_fill : str, int or iterable, optional (default=’mean’)

The fill strategy. If an int, the int value will be applied to all missing values in the H2OFrame. If a string, must be one of (‘mean’, ‘median’, ‘mode’) - note that ‘mode’ is still under development. If an iterable (list, tuple, array, etc.), the length must match the column dimensions. However, if a dict, the strategies will be applied to the mapped columns.

Attributes:

fill_val_ : int, float or iterable

The fill value(s) provided or derived in the fit method.

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the imputer.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform an H2OFrame given the fit imputer.
fit(X)[source]

Fit the imputer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

self :

transform(X)[source]

Transform an H2OFrame given the fit imputer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The test data to transform.

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The transformed (imputed) test data.

class skutil.h2o.H2OSelectiveScaler(feature_names=None, target_feature=None, exclude_features=None, with_mean=True, with_std=True)[source]

Bases: skutil.h2o.base.BaseH2OTransformer

A class that will scale selected features in the H2OFrame.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

with_mean : bool, optional (default=True)

should subtract mean?

with_std : bool, optional (default=True)

should divide by std?

Attributes :

——- :

means : dict (string:float)

The mapping of column names to column means

stds : dict (string:float)

The mapping of column names to column standard deviations

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the transformer.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Do the transformation
fit(X)[source]

Fit the transformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit

transform(X)[source]

Do the transformation

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The test data to transform

Returns:

frame : H2OFrame, shape=(n_samples, n_features)

The transformed test data.

class skutil.h2o.H2OShuffleSplit(n_splits=2, test_size=0.1, train_size=None, random_state=None)[source]

Bases: skutil.h2o.split.H2OBaseShuffleSplit

Default shuffle splitter used for h2o_train_test_split. This shuffle split class will not perform any stratification, and will simply shuffle indices and split into the number of specified sub-frames.

Methods

get_n_splits() Get the number of splits or folds for this instance of the shuffle split.
split(frame[, y]) Split the frame.
class skutil.h2o.H2OSparseFeatureDropper(feature_names=None, target_feature=None, exclude_features=None, threshold=0.5)[source]

Bases: skutil.h2o.select.BaseH2OFeatureSelector

Retains features that are less sparse (NA) than the provided threshold.

Parameters:

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from feature_names

threshold : float, optional (default=0.5)

The threshold of sparsity above which to drop

Attributes:

sparsity_ : array_like, (n_cols,)

The array of sparsity values

drop_ : array_like

The array of column names to drop

.. versionadded:: 0.1.0 :

Methods

fit(X) Fit the H2OTransformer.
fit_transform(frame) Fit the model and then immediately transform the input (training) frame with the fit parameters.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform the test frame, after fitting the transformer.
fit(X)[source]

Fit the H2OTransformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The training data on which to fit.

Returns:

return self :

class skutil.h2o.H2OStratifiedKFold(n_folds=3, shuffle=False, random_state=None)[source]

Bases: skutil.h2o.split._H2OBaseKFold

K-folds cross-validator for an H2OFrame with stratified splits.

Parameters:

n_folds : int, optional (default=3)

The number of splits

shuffle : bool, optional (default=False)

Whether to shuffle indices

random_state : int or RandomState, optional (default=None)

The random state for the split

Methods

get_n_splits() Get the number of splits or folds.
split(frame, y) Split the frame with stratification.
split(frame, y)[source]

Split the frame with stratification.

Parameters:

frame : H2OFrame

The frame to split

y : string

The column to stratify.

class skutil.h2o.H2OStratifiedShuffleSplit(n_splits=2, test_size=0.1, train_size=None, random_state=None)[source]

Bases: skutil.h2o.split.H2OBaseShuffleSplit

Shuffle splitter used for h2o_train_test_split when stratified option is specified. This shuffle split class will perform stratification.

Methods

get_n_splits() Get the number of splits or folds for this instance of the shuffle split.
split(frame, y) Split the frame with stratification.
split(frame, y)[source]

Split the frame with stratification.

Parameters:

frame : H2OFrame

The frame to split

y : string

The column to stratify.

class skutil.h2o.H2OUndersamplingClassBalancer(target_feature, ratio=0.2, shuffle=True)[source]

Bases: skutil.h2o.balance._BaseH2OBalancer

Undersample the majority class until it is represented at the target proportion to the most-represented minority class.

Parameters:

target_feature : str

The name of the response column. The response column must be more than a single class and less than skutil.preprocessing.balance.BalancerMixin._max_classes

ratio : float, optional (default=0.2)

The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input frame

shuffle : bool, optional (default=True)

Whether or not to shuffle rows on return

Examples

Consider the following example: with a ratio of 0.5, the majority class (0) will be undersampled until the second most-populous class (1) is represented at a ratio of 0.5.

>>> def example():
...     import h2o
...     import pandas as pd
...     import numpy as np
...     from skutil.h2o.frame import value_counts
...     from skutil.h2o import from_pandas
...
...     # initialize h2o
...     h2o.init()
...     
...     # read into pandas
...     x = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A'])
...     
...     # load into h2o
...     X = from_pandas(x) 
...     
...     # initialize sampler
...     sampler = H2OUndersamplingClassBalancer(target_feature="A", ratio=0.5)
...     
...     X_balanced = sampler.balance(X)
...     value_counts(X_balanced)
...
>>> example() 
0    60
1    30
2    10
Name A, dtype: int64

New in version 0.1.0.

Methods

balance(X) Apply the undersampling balance operation.
get_params([deep]) Get parameters for this estimator.
load(location) Loads a persisted state of an instance of BaseH2OFunctionWrapper from disk.
save(location[, warn_if_exists]) Saves the BaseH2OFunctionWrapper to disk.
set_params(\*\*params) Set the parameters of this estimator.
balance(X)[source]

Apply the undersampling balance operation. Undersamples the majority class to the provided ratio of minority class(es) : majority class

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The imbalanced dataset.

Returns:

Xb : H2OFrame, shape=(n_samples, n_features)

The balanced H2OFrame

exception skutil.h2o.NAWarning[source]

Bases: exceptions.UserWarning

Custom warning used to notify user that an NA exists within an h2o frame (h2o can handle NA values)

class skutil.h2o.VizMixin[source]

This mixin class provides the interface to plot an H2OEstimator‘s fit performance over a timestep. Any structure that wraps an H2OEstimator’s fitting functionality should derive from this mixin.

Methods

plot(timestep, metric) Plot an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).
plot(timestep, metric)[source]

Plot an H2OEstimator’s performance over a given timestep (x-axis) against a provided metric (y-axis).

Parameters:

timestep : str

A timestep as defined in the H2O API. One of (“AUTO”, “duration”, “number_of_trees”).

metric : str

The performance metric to evaluate. One of (“log_likelihood”, “objective”, “MSE”, “AUTO”)

skutil.h2o.as_series(x)[source]

Make a 1d H2OFrame into a pd.Series.

Parameters:

x : H2OFrame, shape=(n_samples, 1)

The H2OFrame

Returns:

x : Pandas Series, shape=(n_samples,)

The pandas series

skutil.h2o.check_cv(cv=3)[source]

Checks the cv parameter to determine whether it’s a valid int or H2OBaseCrossValidator.

Parameters:

cv : int or H2OBaseCrossValidator, optional (default=3)

The number of folds or the H2OBaseCrossValidator instance.

Returns:

cv : H2OBaseCrossValidator

The instance of H2OBaseCrossValidator

skutil.h2o.check_frame(X, copy=False)[source]

Returns X if X is an H2OFrame else raises a TypeError. If copy is True, will return a copy of X instead.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The frame to evaluate

copy : bool, optional (default=False)

Whether to return a copy of the H2OFrame.

Returns:

X : H2OFrame, shape=(n_samples, n_features)

The frame or the copy

skutil.h2o.check_version(min_version, max_version)[source]

Ensures the currently installed/running version of h2o is compatible with the min_version and max_version the function in question calls for.

Parameters:

min_version : str, float

The minimum version of h2o that is compatible with the transformer

max_version : str, float

The maximum version of h2o that is compatible with the transformer

skutil.h2o.from_array(X, column_names=None)[source]

A simple wrapper for H2OFrame.from_python. This takes a numpy array (or 2d array) and returns an H2OFrame with all the default args.

Parameters:

X : ndarray

The array to convert.

column_names : list, tuple (default=None)

the names to use for your columns

Returns:

H2OFrame :

skutil.h2o.from_pandas(X)[source]

A simple wrapper for H2OFrame.from_python. This takes a pandas dataframe and returns an H2OFrame with all the default args (generally enough) plus named columns.

Parameters:

X : pd.DataFrame

The dataframe to convert.

Returns:

H2OFrame :

skutil.h2o.h2o_accuracy_score(y_actual, y_predict, normalize=True, sample_weight=None, y_type=None)[source]

Accuracy classification score for H2O

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

normalize : bool, optional (default=True)

Whether to average the data

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

score : float

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_auc_score(y_actual, y_predict, average='macro', sample_weight=None, y_type=None)[source]

Compute Area Under the Curve (AUC) using the trapezoidal rule. This implementation is restricted to the binary classification task or multilabel classification task in label indicator format.

NOTE: using H2OFrames, this would require moving the predict vector locally for each task in the average binary score task. It’s more efficient simply to bring both vectors local, and then use the sklearn h2o score. That’s what we’ll do for now.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

average : string, optional (default=’macro’)

One of [None, ‘micro’, ‘macro’ (default), ‘samples’, ‘weighted’]. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'micro':

Calculate metrics globally by considering each element of the label indicator matrix as a label.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).

'samples':

Calculate metrics for each instance, and find their average.

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

auc : float

.. versionadded:: 0.1.6 :

skutil.h2o.h2o_bincount(bins, weights=None, minlength=None)[source]

Given a 1d column of non-negative ints, bins, return a np.ndarray of positional counts of each int.

Parameters:

bins : H2OFrame

The values

weights : list or H2OFrame, optional (default=None)

The weights with which to weight the output

minlength : int, optional (default=None)

The min length of the output array

skutil.h2o.h2o_col_to_numpy(column)[source]

Return a 1d numpy array from a single H2OFrame column.

Parameters:

column : H2OFrame column, shape=(n_samples, 1)

A column from an H2OFrame

Returns:

np.ndarray, shape=(n_samples,) :

skutil.h2o.h2o_corr_plot(X, plot_type='cor', cmap='Blues_d', n_levels=5, figsize=(11, 9), cmap_a=220, cmap_b=10, vmax=0.3, xticklabels=5, yticklabels=5, linewidths=0.5, cbar_kws={'shrink': 0.5}, use='complete.obs', na_warn=True, na_rm=False)[source]

Create a simple correlation plot given a dataframe. Note that this requires all datatypes to be numeric and finite!

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The H2OFrame

plot_type : str, optional (default=’cor’)

The type of plot, one of (‘cor’, ‘kde’, ‘pair’)

cmap : str, optional (default=’Blues_d’)

The color to use for the kernel density estimate plot if plot_type == ‘kde’

n_levels : int, optional (default=5)

The number of levels to use for the kde plot if plot_type == ‘kde’

figsize : tuple (int), optional (default=(11,9))

The size of the image

cmap_a : int, optional (default=220)

The colormap start point

cmap_b : int, optional (default=10)

The colormap end point

vmax : float, optional (default=0.3)

Arg for seaborn heatmap

xticklabels : int, optional (default=5)

The spacing for X ticks

yticklabels : int, optional (default=5)

The spacing for Y ticks

linewidths : float, optional (default=0.5)

The width of the lines

cbar_kws : dict, optional

Any KWs to pass to seaborn’s heatmap when plot_type = ‘cor’

use : str, optional (default=’complete.obs’)

The “use” to compute the correlation matrix

na_warn : bool, optional (default=True)

Whether to warn in the presence of NA values

na_rm : bool, optional (default=False)

Whether to remove NAs

skutil.h2o.h2o_f1_score(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]

Compute the F1 score, the weighted average of the precision and the recall:

F1 = 2 * (precision * recall) / (precision + recall)
Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

labels : list, optional (default=None)

The set of labels to include when average != 'binary', and their order if average is None. By default all labels in y_actual and y_predict are used in sorted order.

pos_label : str or int, optional (default=1)

The class to report if average=='binary' and the data is binary. If the data are multiclass, this will be ignored.

average : str, optional (default=’binary’)

One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

sample_weight : H2OFrame or float, optional (default=None)

The sample weights

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

f : float

The F-1 score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_f_classif(X, feature_names, target_feature)[source]

Compute the ANOVA F-value for the provided sample. This method is adapted from sklearn.feature_selection.f_classif to function on H2OFrames.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The feature matrix. Each feature will be tested sequentially.

feature_names : array_like (str), optional (default=None)

The list of names on which to fit the transformer.

target_feature : str, optional (default=None)

The name of the target feature (is excluded from the fit) for the estimator.

Returns:

f : float

The computed F-value of the test.

prob : float

The associated p-value from the F-distribution.

.. versionadded:: 0.1.2 :

skutil.h2o.h2o_f_oneway(*args)[source]

Performs a 1-way ANOVA. The one-way ANOVA tests the null hypothesis that 2 or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.

Parameters:

sample1, sample2, ... : array_like, H2OFrames, shape=(n_classes,)

The sample measurements should be given as varargs (*args). A slice of the original input frame for each class in the target feature.

Returns:

f : float

The computed F-value of the test.

prob : float

The associated p-value from the F-distribution.

Notes

The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.

  1. The samples are independent
  2. Each sample is from a normally distributed population
  3. The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

If these assumptions are not true for a given set of data, it may still be possible to use the Kruskal-Wallis H-test (scipy.stats.kruskal) although with some loss of power.

The algorithm is from Heiman[2], pp.394-7. See scipy.stats.f_oneway and sklearn.feature_selection.f_oneway.

References

[R5]Lowry, Richard. “Concepts and Applications of Inferential Statistics”. Chapter 14. http://faculty.vassar.edu/lowry/ch14pt1.html
[R6]Heiman, G.W. Research Methods in Statistics. 2002.

New in version 0.1.2.

skutil.h2o.h2o_fbeta_score(y_actual, y_predict, beta, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]

Compute the F-beta score. The F-beta score is the weighted harmonic mean of precision and recall.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

beta : float

The beta value for the F-score

labels : list, optional (default=None)

The set of labels to include when average != 'binary', and their order if average is None. By default all labels in y_actual and y_predict are used in sorted order.

pos_label : str or int, optional (default=1)

The class to report if average=='binary' and the data is binary. If the data are multiclass, this will be ignored.

average : str, optional (default=’binary’)

One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

sample_weight : H2OFrame or float, optional (default=None)

The sample weights

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

f : float

The F-beta score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_frame_memory_estimate(X, bit_est=32, unit='MB')[source]

We estimate the memory footprint of an H2OFrame to determine, possibly, whether it’s capable of being held in memory or not.

Parameters:

X : H2OFrame

The H2OFrame in question

bit_est : int, optional (default=32)

The estimated bit-size of each cell. The default assumes each cell is a signed 32-bit float

unit : str, optional (default=’MB’)

The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)

Returns:

mb : str

The estimated number of UNIT held in the frame

skutil.h2o.h2o_log_loss(y_actual, y_predict, eps=1e-15, normalize=True, sample_weight=None, y_type=None)[source]

Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is

-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))

This method is adapted from the sklearn.metrics.classification.log_loss function for use with ``H2OFrame``s in skutil.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples, [n_classes])

The predicted labels. Can represent a matrix. If y_predict.shape = (n_samples,) the probabilities provided are assumed to be that of the positive class. The labels in y_predict are assumed to be ordered ordinally.

eps : float, optional (default=1e-15)

Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).

normalize : bool, optional (default=True)

If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

loss : float

Notes

The logarithm used is the natural logarithm (base-e).

New in version 0.1.6.

skutil.h2o.h2o_mean_absolute_error(y_actual, y_predict, sample_weight=None, y_type=None)[source]

Mean absolute error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

score : float

The mean absolute error

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_mean_squared_error(y_actual, y_predict, sample_weight=None, y_type=None)[source]

Mean squared error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

score : float

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_median_absolute_error(y_actual, y_predict, sample_weight=None, y_type=None)[source]

Median absolute error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

score : float

The median absolute error score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_precision_score(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]

Compute the precision. Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

labels : list, optional (default=None)

The set of labels to include when average != 'binary', and their order if average is None. By default all labels in y_actual and y_predict are used in sorted order.

pos_label : str or int, optional (default=1)

The class to report if average=='binary' and the data is binary. If the data are multiclass, this will be ignored.

average : str, optional (default=’binary’)

One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

sample_weight : H2OFrame or float, optional (default=None)

The sample weights

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

p : float

The precision score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_r2_score(y_actual, y_predict, sample_weight=None, y_type=None)[source]

R^2 score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

sample_weight : H2OFrame or float, optional (default=None)

A frame of sample weights of matching dims with y_actual and y_predict.

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

score : float

The R^2 score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_recall_score(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]

Compute the recall

Precision is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.

Parameters:

y_actual : H2OFrame, shape=(n_samples,)

The one-dimensional ground truth

y_predict : H2OFrame, shape=(n_samples,)

The one-dimensional predicted labels

labels : list, optional (default=None)

The set of labels to include when average != 'binary', and their order if average is None. By default all labels in y_actual and y_predict are used in sorted order.

pos_label : str or int, optional (default=1)

The class to report if average=='binary' and the data is binary. If the data are multiclass, this will be ignored.

average : str, optional (default=’binary’)

One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

sample_weight : H2OFrame, optional (default=None)

The sample weights

y_type : string, optional (default=None)

The type of the column. If None, will be determined.

Returns:

r : float

The recall score

.. versionadded:: 0.1.0 :

skutil.h2o.h2o_train_test_split(frame, test_size=None, train_size=None, random_state=None, stratify=None)[source]

Splits an H2OFrame into random train and test subsets

Parameters:

frame : H2OFrame

The h2o frame to split

test_size : float, int, or None (default=None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25

train_size : float, int, or None (default=None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int or RandomState

Pseudo-random number generator state used for random sampling.

stratify : str or None (default=None)

The name of the target on which to stratify the sampling

Returns:

out : tuple, shape=(2,)

training_frame
: H2OFrame

The training fold split

testing_frame
: H2OFrame

The testing fold split

skutil.h2o.is_float(x)[source]

Determine whether a 1d H2OFrame is made up of floats.

Parameters:

x : H2OFrame, shape=(n_samples, 1)

The H2OFrame

Returns:

bool : True if float, else False

skutil.h2o.is_integer(x)[source]

Determine whether a 1d H2OFrame is made up of integers.

Parameters:

x : H2OFrame, shape=(n_samples, 1)

The H2OFrame

Returns:

bool : True if integers, else False

skutil.h2o.is_numeric(x)[source]

Determine whether a 1d H2OFrame is numeric.

Parameters:

x : H2OFrame, shape=(n_samples, 1)

The H2OFrame

Returns:

bool : True if numeric, else False

skutil.h2o.load_boston_h2o(include_tgt=True, tgt_name='target', shuffle=False)[source]

Load the boston housing dataset into an H2OFrame

Parameters:

include_tgt : bool, optional (default=True)

Whether or not to include the target

tgt_name : str, optional (default=”target”)

The name of the target column.

shuffle : bool, optional (default=False)

Whether or not to shuffle the data

skutil.h2o.load_breast_cancer_h2o(include_tgt=True, tgt_name='target', shuffle=False)[source]

Load the breast cancer dataset into an H2OFrame

Parameters:

include_tgt : bool, optional (default=True)

Whether or not to include the target

tgt_name : str, optional (default=”target”)

The name of the target column.

shuffle : bool, optional (default=False)

Whether or not to shuffle the data

skutil.h2o.load_iris_h2o(include_tgt=True, tgt_name='Species', shuffle=False)[source]

Load the iris dataset into an H2OFrame

Parameters:

include_tgt : bool, optional (default=True)

Whether or not to include the target

tgt_name : str, optional (default=”Species”)

The name of the target column.

shuffle : bool, optional (default=False)

Whether or not to shuffle the data

skutil.h2o.make_h2o_scorer(score_function, y_actual)[source]

Make a scoring function from a callable. The signature for the callable should resemble:

some_function(y_actual=y_actual, y_predict=y_pred, y_type=None, **kwargs)
Parameters:

score_function : callable

The function

y_actual : H2OFrame, shape=(n_samples,)

A one-dimensional H2OFrame (the ground truth). This is used to determine before hand whether the type is binary or multiclass.

Returns:

score_class : _H2OScorer

An instance of _H2OScorer whose score method will be used for scoring in the skutil.h2o.grid_search module.

.. versionadded:: 0.1.0 :

skutil.h2o.rbind_all(*args)[source]

Given a variable set of H2OFrames, rbind all of them into a single H2OFrame.

Parameters:

array1, array2, ... : H2OFrame, shape=(n_samples, n_features)

The H2OFrames to rbind. All should match in column dimensionality.

Returns:

f : H2OFrame

The rbound H2OFrame

skutil.h2o.reorder_h2o_frame(X, idcs, from_chunks=False)[source]

Currently, H2O does not allow us to reorder frames. This is a hack to rbind rows together in the order prescribed.

Parameters:

X : H2OFrame

The H2OFrame to reorder

idcs : iterable

The order of the H2OFrame rows to be returned.

from_chunks : bool, optional (default=False)

Whether the elements in idcs are optimized chunks generated by _gen_optimized_chunks.

Returns:

new_frame : H2OFrame

The reordered H2OFrame

skutil.h2o.shuffle_h2o_frame(X)[source]

Currently, H2O does not allow us to shuffle frames. This is a hack to rbind rows together in the order prescribed.

Parameters:

X : H2OFrame

The H2OFrame to reorder

Returns:

shuf : H2OFrame

The shuffled H2OFrame

skutil.h2o.validate_x(x)[source]

Given an iterable or None, x, validate that if it is an iterable, it only contains string types.

Parameters:

x : None or iterable, shape=(n_features,)

The feature names

Returns:

x : None or iterable, shape=(n_features,)

The feature names

skutil.h2o.validate_x_y(X, feature_names, target_feature, exclude_features=None)[source]

Validate the feature_names and target_feature arguments passed to an H2OTransformer.

Parameters:

X : H2OFrame, shape=(n_samples, n_features)

The frame from which to drop

feature_names : iterable or None

The feature names to be used in a transformer. If feature_names is None, the transformer will use all of the frame’s column names. However, if the feature_names are an iterable, they must all be either strings or unicode names of columns in the frame.

target_feature : str, unicode or None

The target name to exclude from the transformer analysis. If None, unsupervised is assumed, otherwise must be string or unicode.

exclude_features : iterable or None, optional (default=None)

Any names that should be excluded from x

Returns:

feature_names : list, str

A list of the feature_names as strings

target_feature : str or None

The target_feature as a string if it is not None, else None

skutil.h2o.value_counts(x)[source]

Compute a Pandas-esque value_counts on a 1d H2OFrame.

Parameters:

x : H2OFrame, shape=(n_samples, 1)

The H2OFrame

Returns:

cts : pd.Series, shape=(n_samples,)

The pandas series