skutil.h2o module¶
skutil.h2o bridges the functionality between sklearn and H2O with custom encoders, grid search functionality, and over/undersampling class balancers.
-
class
skutil.h2o.
BaseH2OFeatureSelector
(feature_names=None, target_feature=None, exclude_features=None, min_version='any', max_version=None)[source]¶ Bases:
skutil.h2o.base.BaseH2OTransformer
Base class for all H2O selectors.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
min_version : str or float, optional (default=’any’)
The minimum version of h2o that is compatible with the transformer
max_version : str or float, optional (default=None)
The maximum version of h2o that is compatible with the transformer
.. versionadded:: 0.1.0 :
Methods
fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
BaseH2OFunctionWrapper
(target_feature=None, min_version='any', max_version=None)[source]¶ Bases:
sklearn.base.BaseEstimator
Base class for all H2O estimators or functions.
Parameters: target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit).
min_version : str or float, optional (default=’any’)
The minimum version of h2o that is compatible with the transformer
max_version : str or float, optional (default=None)
The maximum version of h2o that is compatible with the transformer
.. versionadded:: 0.1.0 :
Methods
get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. -
static
load
(location)[source]¶ Loads a persisted state of an instance of
BaseH2OFunctionWrapper
from disk. If the instance is of a more complex class, i.e., one that contains anH2OEstimator
, this method will handle loading these models separately and outside of the constraints of the pickle package.Note that this is a static method and should be called accordingly:
>>> def load_and_transform(X): ... from skutil.h2o.select import H2OMulticollinearityFilterer ... mcf = H2OMulticollinearityFilterer.load(location='example/path.pkl') ... return mcf.transform(X) >>> >>> load_and_transform(X)
Some classes define their own load functionality, and will not work as expected if called in the following manner:
>>> def load_pipe(): ... return BaseH2OFunctionWrapper.load('path/to/h2o/pipeline.pkl') >>> >>> pipe = load_pipe()
This is because of the aforementioned situation wherein some classes handle saves and loads of H2OEstimator objects differently. Thus, any class that is being loaded should be statically referenced at the level of lowest abstraction possible:
>>> def load_pipe(): ... from skutil.h2o.pipeline import H2OPipeline ... return H2OPipeline.load('path/to/h2o/pipeline.pkl') >>> >>> pipe = load_pipe()
Parameters: location : str
The location where the persisted model resides.
Returns: m : BaseH2OFunctionWrapper
The unpickled instance of the model
-
max_version
¶ Returns the max version of H2O that is compatible with the
BaseH2OFunctionWrapper
instance. Some classes differ in their support for H2O versions, due to changes in the underlying API.Returns: mv : string, or None
If there is a max version associated with the
BaseH2OFunctionWrapper
, returns it as a string, otherwise returns None.
-
min_version
¶ Returns the min version of H2O that is compatible with the
BaseH2OFunctionWrapper
instance. Some classes differ in their support for H2O versions, due to changes in the underlying API.Returns: mv : string
If there is a min version associated with the
BaseH2OFunctionWrapper
, returns it as a string, otherwise returns ‘any’
-
save
(location, warn_if_exists=True, **kwargs)[source]¶ Saves the
BaseH2OFunctionWrapper
to disk. If the instance is of a more complex class, i.e., one that contains an H2OEstimator, this method will handle saving these models separately and outside of the constraints of the pickle package. Any key-word arguments will be passed to the _save_internal method (if it exists).Parameters: location : str
The absolute path of location where the transformer should be saved.
warn_if_exists : bool, optional (default=True)
Warn the user that
location
exists if True.
-
static
-
class
skutil.h2o.
BaseH2OTransformer
(feature_names=None, target_feature=None, exclude_features=None, min_version='any', max_version=None)[source]¶ Bases:
skutil.h2o.base.BaseH2OFunctionWrapper
,sklearn.base.TransformerMixin
Base class for all H2OTransformers.
Parameters: feature_names : array_like, str
The list of names on which to fit the feature selector.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
during the fit.min_version : str or float, optional (default=’any’)
The minimum version of h2o that is compatible with the transformer
max_version : str or float, optional (default=None)
The maximum version of h2o that is compatible with the transformer
.. versionadded:: 0.1.0 :
Methods
fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator.
-
class
skutil.h2o.
H2OFScoreKBestSelector
(feature_names=None, target_feature=None, exclude_features=None, cv=3, k=10, iid=True)[source]¶ Bases:
skutil.h2o.one_way_fs._BaseH2OFScoreSelector
Select the top
k
features based on the F-score, using theh2o_f_classif
method.Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
cv : int or H2OBaseCrossValidator, optional (default=3)
Univariate feature selection can very easily remove features erroneously or cause overfitting. Using cross validation, we can more confidently select the features to drop.
k : int, optional (default=10)
The number of features to keep.
iid : bool, optional (default=True)
Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold
min_version : str or float, optional (default=’any’)
The minimum version of h2o that is compatible with the transformer
max_version : str or float, optional (default=None)
The maximum version of h2o that is compatible with the transformer
Attributes: scores_ : np.ndarray, float
The score array, adjusted for
n_folds
p_values_ : np.ndarray, float
The p-value array, adjusted for
n_folds
.. versionadded:: 0.1.2 :
Methods
fit
(X)Fit the F-score feature selector. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2OFScorePercentileSelector
(feature_names=None, target_feature=None, exclude_features=None, cv=3, percentile=10, iid=True)[source]¶ Bases:
skutil.h2o.one_way_fs._BaseH2OFScoreSelector
Select the top percentile of features based on the F-score, using the
h2o_f_classif
method.Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
cv : int or H2OBaseCrossValidator, optional (default=3)
Univariate feature selection can very easily remove features erroneously or cause overfitting. Using cross validation, we can more confidently select the features to drop.
percentile : int, optional (default=10)
The percent of features to keep.
iid : bool, optional (default=True)
Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold
min_version : str or float, optional (default=’any’)
The minimum version of h2o that is compatible with the transformer
max_version : str or float, optional (default=None)
The maximum version of h2o that is compatible with the transformer
Attributes: scores_ : np.ndarray, float
The score array, adjusted for
n_folds
p_values_ : np.ndarray, float
The p-value array, adjusted for
n_folds
.. versionadded:: 0.1.2 :
Methods
fit
(X)Fit the F-score feature selector. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2OFeatureDropper
(feature_names=None, target_feature=None, exclude_features=None)[source]¶ Bases:
skutil.h2o.select.BaseH2OFeatureSelector
A very simple class to be used at the beginning or any stage of an H2OPipeline that will drop the given features from the remainder of the pipe.
This is useful when you have many features, but only a few to drop. Rather than passing the feature_names arg as the delta between many features and the several to drop, this allows you to drop them and keep feature_names as None in future steps.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
Attributes: drop_ : list (str)
These are the features that will be dropped by the
FeatureDropper
.. versionadded:: 0.1.0 :
Methods
fit
(X)Fit the H2OTransformer. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2OGainsRandomizedSearchCV
(estimator, param_grid, feature_names, target_feature, exposure_feature, loss_feature, premium_feature=None, n_iter=10, random_state=None, scoring='lift', scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias', error_score=nan, error_behavior='warn')[source]¶ Bases:
skutil.h2o.grid_search.H2ORandomizedSearchCV
A grid search that scores based on actuarial metrics (See
skutil.metrics.GainsStatisticalReport
). This is a more customized form of grid search, and must use a gains metric provided by theGainsStatisticalReport
.Parameters: estimator : H2OPipeline or H2OEstimator
The estimator to fit. Either an :class:
skutil.h2o.H2OPipeline
or aH2OEstimator
. If theestimator
is a pipeline, it must contain an estimator as the final step.param_grid : dict
The hyper parameter grid over which to search. If
estimator
is an :class:skutil.h2o.H2OPipeline
, theparam_grid
should be in the form of{'stepname__param':[values]}
; if there are not named steps (i.e., ifestimator
is anH2OEstimator
),param_grid
should be in the form of{'param':[values]}
. Note that aparam_grid
with named step parameters in the absence of named steps will raise an error.feature_names : iterable (str)
The list of feature names on which to fit
target_feature : str
The name of the target
exposure_feature : str
The name of the exposure feature
loss_feature : str
The name of the loss feature
premium_feature : str
The name of the premium feature
n_iter : int, optional (default=10)
The number of iterations to fit. Note that
n_iter * cv.get_n_splits
will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.random_state : int, optional (default=None)
The random state for the search
scoring : str, optional (default=’lift’)
One of {‘lift’,’gini’} or other valid GainsStatisticalReport scoring metrics.
scoring_params : dict, optional (default=None)
Any kwargs to be passed to the scoring function for scoring at each iteration.
cv : int or H2OCrossValidator, optional (default=5)
The number of folds to be fit for cross validation. Note that
n_iter * cv.get_n_splits
will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.verbose : int, optional (default=0)
The level of verbosity. 1, 2 or greater. A
verbosity
of 0 will produce no output other than the default H2O fit/predict output. Averbosity
of 1 will print the selected parameters at each fold and iteration, and averbosity
of 2 will produce all of the aforementioned output plus the intermediate fold scores.iid : bool, optional (default=True)
Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
validation_frame : H2OFrame, optional (default=None)
Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.
minimize : str, optional (default=’bias’)
How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.
error_score : float, optional (default=np.nan)
The default score to use in the case of a pd.qcuts ValueError (when there are non-unique bin edges)
error_behavior : str, optional (default=’warn’)
How to handle the pd.qcut ValueError. One of {‘warn’,’raise’,’ignore’}
.. versionadded:: 0.1.0 :
Methods
download_pojo
(\*args, \*\*kwargs)This method is injected at runtime if the best_estimator_
is an instance of anH2OEstimator
.fit
(frame)Fit the grid search. fit_predict
(frame)First, fits the grid search and then generates predictions on the training frame using the best_estimator_
.get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OSearchCV from disk. plot
(timestep, metric)Plot an H2OEstimator’s performance over a given timestep
(x-axis) against a providedmetric
(y-axis).predict
(\*args, \*\*kwargs)After the grid search is fit, generates predictions on the test frame using the best_estimator_
.report_scores
()Create a dataframe report for the fitting and scoring of the gains search. save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.score
(frame)Predict and score on a new frame. set_params
(\*\*params)Set the parameters of this estimator. varimp
(\*args, \*\*kwargs)Get the variable importance, if the final estimator implements such a function. -
fit
(frame)[source]¶ Fit the grid search.
Parameters: frame : H2OFrame, shape=(n_samples, n_features)
The training frame on which to fit.
-
report_scores
()[source]¶ Create a dataframe report for the fitting and scoring of the gains search. Will report lift, gini and any other relevant metrics. If a validation set was included, will also report validation scores.
Returns: rdf : pd.DataFrame, shape=(n_iter, n_params)
The grid search report
-
score
(frame)[source]¶ Predict and score on a new frame. Note that this method will not store performance metrics in the report that
report_score
generates.Parameters: frame : H2OFrame, shape=(n_samples, n_features)
The test frame on which to predict and score performance.
Returns: scor : float
The score on the testing frame
-
-
class
skutil.h2o.
H2OGridSearchCV
(estimator, param_grid, feature_names, target_feature, scoring=None, scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias')[source]¶ Bases:
skutil.h2o.grid_search.BaseH2OSearchCV
An exhaustive grid search that will fit models across the entire hyperparameter grid provided.
Parameters: estimator : H2OPipeline or H2OEstimator
The estimator to fit. Either an :class:
skutil.h2o.H2OPipeline
or aH2OEstimator
. If theestimator
is a pipeline, it must contain an estimator as the final step.param_grid : dict
The hyper parameter grid over which to search. If
estimator
is an :class:skutil.h2o.H2OPipeline
, theparam_grid
should be in the form of{'stepname__param':[values]}
; if there are not named steps (i.e., ifestimator
is anH2OEstimator
),param_grid
should be in the form of{'param':[values]}
. Note that aparam_grid
with named step parameters in the absence of named steps will raise an error.feature_names : iterable (str)
The list of feature names on which to fit
target_feature : str
The name of the target
scoring : str, optional (default=’lift’)
A valid scoring metric, i.e., “accuracy_score”. See
skutil.h2o.grid_search.SCORERS
for a comprehensive list.scoring_params : dict, optional (default=None)
Any kwargs to be passed to the scoring function for scoring at each iteration.
cv : int or H2OCrossValidator, optional (default=5)
The number of folds to be fit for cross validation.
verbose : int, optional (default=0)
The level of verbosity. 1, 2 or greater. A
verbosity
of 0 will produce no output other than the default H2O fit/predict output. Averbosity
of 1 will print the selected parameters at each fold and iteration, and averbosity
of 2 will produce all of the aforementioned output plus the intermediate fold scores.iid : bool, optional (default=True)
Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
validation_frame : H2OFrame, optional (default=None)
Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.
minimize : str, optional (default=’bias’)
How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.
.. versionadded:: 0.1.0 :
Methods
download_pojo
(\*args, \*\*kwargs)This method is injected at runtime if the best_estimator_
is an instance of anH2OEstimator
.fit
(frame)Fit the grid search. fit_predict
(frame)First, fits the grid search and then generates predictions on the training frame using the best_estimator_
.get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OSearchCV from disk. plot
(timestep, metric)Plot an H2OEstimator’s performance over a given timestep
(x-axis) against a providedmetric
(y-axis).predict
(\*args, \*\*kwargs)After the grid search is fit, generates predictions on the test frame using the best_estimator_
.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.score
(frame)After the grid search is fit, generates and scores the predictions of the best_estimator_
.set_params
(\*\*params)Set the parameters of this estimator. varimp
(\*args, \*\*kwargs)Get the variable importance, if the final estimator implements such a function.
-
class
skutil.h2o.
H2OInteractionTermTransformer
(feature_names=None, target_feature=None, exclude_features=None, interaction_function=None, name_suffix='I', only_return_interactions=False)[source]¶ Bases:
skutil.h2o.base.BaseH2OTransformer
A class that will generate interaction terms between selected columns. An interaction captures some relationship between two independent variables in the form of:
\(In = (x_i * x_j)\)
Note that the
H2OInteractionTermTransformer
will only operate on the feature_names, and at the transform point will return ALL features plus the newly generated ones unless otherwise specified in theonly_return_interactions
parameter.Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
interaction : callable, optional (default=None)
A callable for interactions. Default None will result in multiplication of two Series objects
name_suffix : str, optional (default=’I’)
The suffix to add to the new feature name in the form of <feature_x>_<feature_y>_<suffix>
only_return_interactions : bool, optional (default=False)
If set to True, will only return features in feature_names and their respective generated interaction terms.
Attributes: fun_ : callable
The interaction term function assigned in the
fit
method... versionadded:: 0.1.0 :
Methods
fit
(frame)Fit the transformer. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Perform the interaction term expansion.
-
class
skutil.h2o.
H2OKFold
(n_folds=3, shuffle=False, random_state=None)[source]¶ Bases:
skutil.h2o.split._H2OBaseKFold
K-folds cross-validator for an H2OFrame.
Parameters: n_folds : int, optional (default=3)
The number of splits
shuffle : bool, optional (default=False)
Whether to shuffle indices
random_state : int or RandomState, optional (default=None)
The random state for the split
Methods
get_n_splits
()Get the number of splits or folds. split
(frame[, y])Split the frame.
-
class
skutil.h2o.
H2OLabelEncoder
[source]¶ Bases:
skutil.h2o.base.BaseH2OTransformer
Encode categorical values in a H2OFrame (single column) into ordinal labels 0 - len(column) - 1.
Attributes: classes_ : np.ndarray
The unique class levels
.. versionadded:: 0.1.0 :
Examples
>>> def example(): ... import pandas as pd ... from skutil.h2o import from_pandas ... from skutil.h2o.transform import H2OLabelEncoder ... ... x = pd.DataFrame.from_records(data=[ ... [5, 4], ... [6, 2], ... [5, 1], ... [7, 9], ... [7, 2]], columns=['C1', 'C2']) ... ... X = from_pandas(x) ... encoder = H2OLabelEncoder() ... encoder.fit_transform(X['C1']) >>> >>> example() C1 ---- 0 1 0 2 2 [5 rows x 1 column]
Methods
fit
(column)fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(column)
-
class
skutil.h2o.
H2OMulticollinearityFilterer
(feature_names=None, target_feature=None, exclude_features=None, threshold=0.85, na_warn=True, na_rm=False, use='complete.obs')[source]¶ Bases:
skutil.h2o.select.BaseH2OFeatureSelector
Filter out features with a correlation greater than the provided threshold. When a pair of correlated features is identified, the mean absolute correlation (MAC) of each feature is considered, and the feature with the highest MAC is discarded.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
threshold : float, optional (default=0.85)
The threshold above which to filter correlated features
na_warn : bool, optional (default=True)
Whether to warn if any NAs are present
na_rm : bool, optional (default=False)
Whether to remove NA values
use : str, optional (default “complete.obs”)
One of {‘complete.obs’,’all.obs’,’everything’}. A string indicating how to handle missing values.
Attributes: drop_ : list, string
The columns to drop
mean_abs_correlations_ : list, float
The corresponding mean absolute correlations of each drop_ name
correlations_ : named tuple
A list of tuples with each tuple containing the two correlated features, the level of correlation, the feature that was selected for dropping, and the mean absolute correlation of the dropped feature.
.. versionadded:: 0.1.0 :
Methods
fit
(X)Fit the H2OTransformer. fit_transform
(X)Fit the multicollinearity filterer and return the transformed H2OFrame, X. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2ONearZeroVarianceFilterer
(feature_names=None, target_feature=None, exclude_features=None, threshold=1e-06, na_warn=True, na_rm=False, use='complete.obs', strategy='variance')[source]¶ Bases:
skutil.h2o.select.BaseH2OFeatureSelector
Identify and remove any features that have a variance below a certain threshold. There are two possible strategies for near-zero variance feature selection:
- Select features on the basis of the actual variance they exhibit. This is only relevant when the features are real numbers.
- Remove features where the ratio of the frequency of the most prevalent value to that of the second-most frequent value is large, say 20 or above (Kuhn & Johnson[1]).
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
threshold : float, optional (default=1e-6)
The threshold below which to declare “zero variance”
na_warn : bool, optional (default=True)
Whether to warn if any NAs are present
na_rm : bool, optional (default=False)
Whether to remove NA values
use : str, optional (default “complete.obs”)
One of {‘complete.obs’,’all.obs’,’everything’} A string indicating how to handle missing values.
strategy : str, optional (default=’variance’)
The strategy by which feature selection should be performed, one of (‘variance’, ‘ratio’). If
strategy
is ‘variance’, features will be selected based on the amount of variance they exhibit; those that are low-variance (belowthreshold
) will be removed. Ifstrategy
is ‘ratio’, features are dropped if the most prevalent value is represented at a ratio greater thanthreshold
to the second-most frequent value. Note that ifstrategy
is ‘ratio’,threshold
must be greater than 1.Attributes: drop_ : list, string
The columns to drop
var_ : dict
The dropped columns mapped to their corresponding variances or ratios, depending on the
strategy
References
[R4] Kuhn, M. & Johnson, K. “Applied Predictive Modeling” (2013). New York, NY: Springer. New in version 0.1.0.
Methods
fit
(X)Fit the near zero variance filterer, return the transformed X frame. fit_transform
(X)Fit the near zero variance filterer. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2OOversamplingClassBalancer
(target_feature, ratio=0.2, shuffle=True)[source]¶ Bases:
skutil.h2o.balance._BaseH2OBalancer
Oversample the minority classes until they are represented at the target proportion to the majority class.
Parameters: target_feature : str
The name of the response column. The response column must be more than a single class and less than
skutil.preprocessing.balance.BalancerMixin._max_classes
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input frame
shuffle : bool, optional (default=True)
Whether or not to shuffle rows on return
Examples
Consider the following example: with a
ratio
of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)>>> def example(): ... import h2o ... import pandas as pd ... import numpy as np ... from skutil.h2o.frame import value_counts ... from skutil.h2o import from_pandas ... ... # initialize h2o ... h2o.init() ... ... # read into pandas ... x = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A']) ... ... # load into h2o ... X = from_pandas(x) ... ... # initialize sampler ... sampler = H2OOversamplingClassBalancer(target_feature="A", ratio=0.5) ... ... # do balancing ... X_balanced = sampler.balance(X) ... value_counts(X_balanced) >>> >>> example() 0 100 1 50 2 50 Name A, dtype: int64
New in version 0.1.0.
Methods
balance
(X)Apply the oversampling balance operation. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. -
balance
(X)[source]¶ Apply the oversampling balance operation. Oversamples the minority class to the provided ratio of minority class(es) : majority class.
Parameters: X :
H2OFrame
, shape=(n_samples, n_features)The imbalanced dataset.
Returns: Xb :
H2OFrame
, shape=(n_samples, n_features)The balanced H2OFrame
-
-
class
skutil.h2o.
H2OPipeline
(steps, feature_names=None, target_feature=None, exclude_from_ppc=None, exclude_from_fit=None)[source]¶ Bases:
skutil.h2o.base.BaseH2OFunctionWrapper
,skutil.h2o.base.VizMixin
Create a sklearn-esque pipeline of H2O steps finished with an optional H2OEstimator. Note that as of version 0.1.0, the behavior of the H2OPipeline has slightly changed, given the inclusion of the
exclude_from_ppc
andexclude_from_fit
parameters.The pipeline, at the core, is comprised of a list of length-two tuples in the form of
('name', SomeH2OTransformer())
, punctuated with an optional H2OEstimator as the final step. The pipeline will procedurally fit each stage, transforming the training data prior to fitting the next stage. When predicting or transforming new (test) data, each stage calls eithertransform
orpredict
at the respective step.On the topic of exclusions and ``feature_names``:
Prior to version 0.1.0, H2OTransformers did not take the keyword
exclude_features
. Its addition necessitated two new keywords in the H2OPipeline, and a slight change in behavior offeature_names
:exclude_from_ppc
- If set in the H2OPipeline constructor, these featureswill be universally omitted from every preprocessing stage. Since
exclude_features
can be set individually in each separate transformer, in the case thatexclude_features
has been explicitly set, the exclusions in that respective stage will include the union ofexclude_from_ppc
andexclude_features
.
exclude_from_fit
- If set in the H2OPipeline constructor, these featureswill be omitted from the
training_cols_
fit attribute, which are the columns passed to the final stage in the pipeline.
feature_names
- The former behavior of the H2OPipeline only usedfeature_names
in the fit of the first transformer, passing the remaining columns to the next transformer as the
feature_names
parameter. The new behavior is more discriminating in the case of explicitly-set attributes. In the case where a transformer’sfeature_names
parameter has been explicitly set, only those names will be used in the fit. This is useful in cases where someone may only want to, for instance, drop one of two multicollinear features using the H2OMulticollinearityFilterer rather than fitting against the entire dataset. It also adheres to the now expected behavior of the exclusion parameters.
Parameters: steps : list
A list of named tuples wherein element 1 of each tuple is an instance of a BaseH2OTransformer or an H2OEstimator.
feature_names : iterable, optional (default=None)
The names of features on which to fit the first transformer in the pipeline. The next transformer will be fit with
feature_names
as the result-set columns from the previous transformer, minus any exclusions or target features.target_feature : str, optional (default=None)
The name of the target feature
exclude_from_ppc : iterable, optional (default=None)
Any names to be excluded from any preprocessor fits. Since the
exclude_features
can be set in respective steps in each preprocessor, these features will be considered as global exclusions and will be appended to any individually set exclusion features.exclude_from_fit : iterable, optional (default=None)
Any names to be excluded from the final model fit
Attributes: training_cols_ : list (str), shape=(n_features,)
The columns that are retained for training purposes after the
_pre_transform
operation, which fits the series of transformers but not the final estimator... versionadded:: 0.1.0 :
Examples
The following is a simple example of an
H2OPipeline
in use:>>> def example(): ... import h2o ... from h2o.estimators import H2ORandomForestEstimator ... from skutil.h2o import H2OMulticollinearityFilterer ... from skutil.h2o import load_iris_h2o ... ... ... # initialize h2o ... h2o.init() ... ... # load into h2o ... X = load_iris_h2o(tgt_name="Species") ... ... # get feature names and target ... x, y = X.columns[:-1], X.columns[-1] ... ... # define and fit the pipe ... pipe = H2OPipeline([ ... ('mcf', H2OMulticollinearityFilterer()), ... ('clf', H2ORandomForestEstimator()) ... ], feature_names=x, target_feature=y).fit() >>> >>> example()
This a more advanced example of the
H2OPipeline
(including use of theexclude_from_ppc
andexclude_from_fit
parameters):>>> def example(): ... import h2o ... from skutil.h2o import load_boston_h2o ... from skutil.h2o import h2o_train_test_split ... from skutil.h2o.transform import H2OSelectiveScaler ... from skutil.h2o.select import H2OMulticollinearityFilterer ... from h2o.estimators import H2OGradientBoostingEstimator ... ... ... # initialize h2o ... h2o.init() ... ... # load into h2o ... X = load_boston_h2o(include_tgt=True, shuffle=True, tgt_name='target') ... ... # this splits our data ... X_train, X_test = h2o_train_test_split(X, train_size=0.7) ... ... ... # Declare our pipe - this one is intentionally a bit complex in behavior ... pipe = H2OPipeline([ ... ('scl', H2OSelectiveScaler(feature_names=['B','PTRATIO','CRIM'])), # will ONLY operate on these ... ('mcf', H2OMulticollinearityFilterer(exclude_features=['CHAS'])), # will exclude this & 'TAX' ... ('gbm', H2OGradientBoostingEstimator()) ... ], exclude_from_ppc=['TAX'], # excluded from all preprocessor fits ... feature_names=None, # fit the first stage on ALL features (minus exceptions) ... target_feature='target' # will be excluded from all preprocessor fits, as it's the target ... ).fit(X_train) >>> >>> example()
Methods
download_pojo
(\*args, \*\*kwargs)This method is injected at runtime if the _final_estimator
is an instance of anH2OEstimator
.fit
(frame)Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. fit_predict
(\*args, \*\*kwargs)Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. fit_transform
(\*args, \*\*kwargs)Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of H2OPipeline from disk. plot
(\*args, \*\*kwargs)If the _final_estimator
is an H2OEstimator, this method is injected at runtime.predict
(\*args, \*\*kwargs)Applies transforms to the data, and the predict method of the final estimator. save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters for this pipeline. transform
(\*args, \*\*kwargs)Applies transforms to the data. varimp
(\*args, \*\*kwargs)Get the variable importance, if the final estimator implements such a function. -
download_pojo
(*args, **kwargs)[source]¶ This method is injected at runtime if the
_final_estimator
is an instance of anH2OEstimator
. This method downloads the POJO from a fit estimator.Parameters: path : string, optional (default=””)
Path to folder in which to save the POJO.
get_jar : bool, optional (default=True)
Whether to get the jar from the POJO.
Returns: None or string :
Returns None if
path
is “” else, the filepath where the POJO was saved.
-
fit
(frame)[source]¶ Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Parameters: frame :
H2OFrame
, shape=(n_samples, n_features)Training data on which to fit. Must fulfill input requirements of first step of the pipeline.
Returns: self :
-
fit_predict
(*args, **kwargs)[source]¶ Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. Finally, predict on the final step.
Parameters: frame : H2OFrame, shape=(n_samples, n_features)
Training data. Must fulfill input requirements of first step of the pipeline.
-
fit_transform
(*args, **kwargs)[source]¶ Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. Finally, transform on the final step.
Parameters: frame : H2OFrame, shape=(n_samples, n_features)
Training data. Must fulfill input requirements of first step of the pipeline.
Returns: Xt : H2OFrame, shape=(n_samples, n_features)
The transformed training data
-
static
load
(location)[source]¶ Loads a persisted state of an instance of H2OPipeline from disk. This method will handle loading H2OEstimator models separately and outside of the constraints of the pickle package.
Note that this is a static method and should be called accordingly:
>>> def load_pipe(): ... return H2OPipeline.load('path/to/h2o/pipeline.pkl') # GOOD! >>> >>> pipe = load_pipe()
Also note that since H2OPipeline can contain an H2OEstimator, it’s
load
functionality differs from that of its superclass, BaseH2OFunctionWrapper and will not function properly if called at the highest level of abstraction:>>> def load_pipe(): ... return BaseH2OFunctionWrapper.load('path/to/h2o/pipeline.pkl') # BAD! >>> >>> pipe = load_pipe()
Furthermore, trying to load a different type of BaseH2OFunctionWrapper from this method will raise a TypeError:
>>> def load_pipe(): ... return H2OPipeline.load('path/to/some/other/transformer.pkl') # BAD! >>> >>> pipe = load_pipe()
Parameters: location : str
The location where the persisted H2OPipeline model resides.
Returns: model : H2OPipeline
The unpickled instance of the H2OPipeline model
-
named_steps
¶ Generates a dictionary of all of the stages where the stage name is the key, and the stage is the value. Note that dictionaries are not guaranteed a specific order!!!
Returns: d : dict
The dictionary of named steps.
-
plot
(*args, **kwargs)[source]¶ If the
_final_estimator
is an H2OEstimator, this method is injected at runtime. This method plots an H2OEstimator’s performance over a giventimestep
(x-axis) against a providedmetric
(y-axis).Parameters: timestep : str
A timestep as defined in the H2O API. One of (“AUTO”, “duration”, “number_of_trees”).
metric : str
The performance metric to evaluate. One of (“log_likelihood”, “objective”, “MSE”, “AUTO”)
-
predict
(*args, **kwargs)[source]¶ Applies transforms to the data, and the predict method of the final estimator. Valid only if the final estimator implements predict.
Parameters: frame : H2OFrame, shape=(n_samples, n_features)
Data to predict on. Must fulfill input requirements of first step of the pipeline.
-
set_params
(**params)[source]¶ Set the parameters for this pipeline. Will revalidate the steps in the estimator prior to setting the parameters. Parameters is a **kwargs-style dictionary whose keys should be prefixed by the name of the step targeted and a double underscore:
>>> def example(): ... from skutil.h2o.select import H2OMulticollinearityFilterer ... from h2o.estimators import H2ORandomForestEstimator ... ... pipe = H2OPipeline([ ... ('mcf', H2OMulticollinearityFilterer()), ... ('rf', H2ORandomForestEstimator()) ... ]) ... ... pipe.set_params(**{ ... 'rf__ntrees': 100, ... 'mcf__threshold': 0.75 ... }) >>> >>> example()
Returns: self :
-
transform
(*args, **kwargs)[source]¶ Applies transforms to the data. Valid only if the final estimator implements predict.
Parameters: frame : H2OFrame, shape=(n_samples, n_features)
Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns: Xt : H2OFrame, shape=(n_samples, n_features)
The transformed test data
-
class
skutil.h2o.
H2ORandomizedSearchCV
(estimator, param_grid, feature_names, target_feature, n_iter=10, random_state=None, scoring=None, scoring_params=None, cv=5, verbose=0, iid=True, validation_frame=None, minimize='bias')[source]¶ Bases:
skutil.h2o.grid_search.BaseH2OSearchCV
A grid search that operates over a random sub-hyperparameter space at each iteration.
Parameters: estimator : H2OPipeline or H2OEstimator
The estimator to fit. Either an :class:
skutil.h2o.H2OPipeline
or aH2OEstimator
. If theestimator
is a pipeline, it must contain an estimator as the final step.param_grid : dict
The hyper parameter grid over which to search. If
estimator
is an :class:skutil.h2o.H2OPipeline
, theparam_grid
should be in the form of{'stepname__param':[values]}
; if there are not named steps (i.e., ifestimator
is anH2OEstimator
),param_grid
should be in the form of{'param':[values]}
. Note that aparam_grid
with named step parameters in the absence of named steps will raise an error.feature_names : iterable (str)
The list of feature names on which to fit
target_feature : str
The name of the target
n_iter : int, optional (default=10)
The number of iterations to fit. Note that
n_iter * cv.get_n_splits
will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.random_state : int, optional (default=None)
The random state for the search
scoring : str, optional (default=’lift’)
A valid scoring metric, i.e., “accuracy_score”. See
skutil.h2o.grid_search.SCORERS
for a comprehensive list.scoring_params : dict, optional (default=None)
Any kwargs to be passed to the scoring function for scoring at each iteration.
cv : int or H2OCrossValidator, optional (default=5)
The number of folds to be fit for cross validation. Note that
n_iter * cv.get_n_splits
will be fit. If there are 10 folds and 10 iterations, 100 models (plus one) will be fit.verbose : int, optional (default=0)
The level of verbosity. 1, 2 or greater. A
verbosity
of 0 will produce no output other than the default H2O fit/predict output. Averbosity
of 1 will print the selected parameters at each fold and iteration, and averbosity
of 2 will produce all of the aforementioned output plus the intermediate fold scores.iid : bool, optional (default=True)
Whether to consider each fold as IID. The fold scores are normalized at the end by the number of observations in each fold. If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
validation_frame : H2OFrame, optional (default=None)
Whether to score on the full validation frame at the end of all of the model fits. Note that this will NOT be used in the actual model selection process.
minimize : str, optional (default=’bias’)
How the search selects the best model to fit on the entire dataset. One of {‘bias’,’variance’}. The default behavior is ‘bias’, which is also the default behavior of sklearn. This will select the set of hyper parameters which maximizes the cross validation score mean. Alternatively, ‘variance’ will select the model which minimizes the standard deviations between cross validation scores.
.. versionadded:: 0.1.0 :
Methods
download_pojo
(\*args, \*\*kwargs)This method is injected at runtime if the best_estimator_
is an instance of anH2OEstimator
.fit
(frame)Fit the grid search. fit_predict
(frame)First, fits the grid search and then generates predictions on the training frame using the best_estimator_
.get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OSearchCV from disk. plot
(timestep, metric)Plot an H2OEstimator’s performance over a given timestep
(x-axis) against a providedmetric
(y-axis).predict
(\*args, \*\*kwargs)After the grid search is fit, generates predictions on the test frame using the best_estimator_
.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.score
(frame)After the grid search is fit, generates and scores the predictions of the best_estimator_
.set_params
(\*\*params)Set the parameters of this estimator. varimp
(\*args, \*\*kwargs)Get the variable importance, if the final estimator implements such a function.
-
class
skutil.h2o.
H2OSafeOneHotEncoder
(feature_names=None, target_feature=None, exclude_features=None, drop_after_encoded=True)[source]¶ Bases:
skutil.h2o.base.BaseH2OTransformer
Given a set of feature_names, one-hot encodes (dummies) a set of vecs into an expanded set of dummied columns. Will drop the original columns after transformation, unless otherwise specified.
Parameters: feature_names : array_like (str) shape=(n_features,), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : array_like (str) shape=(n_features,), optional (default=None)
Any names that should be excluded from
feature_names
drop_after_encoded : bool (default=True)
Whether to drop the original columns after transform
.. versionadded:: 0.1.0 :
Methods
fit
(X)Fit the one hot encoder. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a new frame after fit.
-
class
skutil.h2o.
H2OSelectiveImputer
(feature_names=None, target_feature=None, exclude_features=None, def_fill='mean')[source]¶ Bases:
skutil.h2o.transform._H2OBaseImputer
The selective imputer provides extreme flexibility and simplicity in imputation tasks. Rather than imposing one strategy across an entire frame, different strategies can be mapped to respective features.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
def_fill : str, int or iterable, optional (default=’mean’)
The fill strategy. If an int, the int value will be applied to all missing values in the H2OFrame. If a string, must be one of (‘mean’, ‘median’, ‘mode’) - note that ‘mode’ is still under development. If an iterable (list, tuple, array, etc.), the length must match the column dimensions. However, if a dict, the strategies will be applied to the mapped columns.
Attributes: fill_val_ : int, float or iterable
The fill value(s) provided or derived in the
fit
method... versionadded:: 0.1.0 :
Methods
fit
(X)Fit the imputer. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform an H2OFrame given the fit imputer.
-
class
skutil.h2o.
H2OSelectiveScaler
(feature_names=None, target_feature=None, exclude_features=None, with_mean=True, with_std=True)[source]¶ Bases:
skutil.h2o.base.BaseH2OTransformer
A class that will scale selected features in the H2OFrame.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
with_mean : bool, optional (default=True)
should subtract mean?
with_std : bool, optional (default=True)
should divide by std?
Attributes :
——- :
means : dict (string:float)
The mapping of column names to column means
stds : dict (string:float)
The mapping of column names to column standard deviations
.. versionadded:: 0.1.0 :
Methods
fit
(X)Fit the transformer. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Do the transformation
-
class
skutil.h2o.
H2OShuffleSplit
(n_splits=2, test_size=0.1, train_size=None, random_state=None)[source]¶ Bases:
skutil.h2o.split.H2OBaseShuffleSplit
Default shuffle splitter used for
h2o_train_test_split
. This shuffle split class will not perform any stratification, and will simply shuffle indices and split into the number of specified sub-frames.Methods
get_n_splits
()Get the number of splits or folds for this instance of the shuffle split. split
(frame[, y])Split the frame.
-
class
skutil.h2o.
H2OSparseFeatureDropper
(feature_names=None, target_feature=None, exclude_features=None, threshold=0.5)[source]¶ Bases:
skutil.h2o.select.BaseH2OFeatureSelector
Retains features that are less sparse (NA) than the provided threshold.
Parameters: feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
feature_names
threshold : float, optional (default=0.5)
The threshold of sparsity above which to drop
Attributes: sparsity_ : array_like, (n_cols,)
The array of sparsity values
drop_ : array_like
The array of column names to drop
.. versionadded:: 0.1.0 :
Methods
fit
(X)Fit the H2OTransformer. fit_transform
(frame)Fit the model and then immediately transform the input (training) frame with the fit parameters. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform the test frame, after fitting the transformer.
-
class
skutil.h2o.
H2OStratifiedKFold
(n_folds=3, shuffle=False, random_state=None)[source]¶ Bases:
skutil.h2o.split._H2OBaseKFold
K-folds cross-validator for an H2OFrame with stratified splits.
Parameters: n_folds : int, optional (default=3)
The number of splits
shuffle : bool, optional (default=False)
Whether to shuffle indices
random_state : int or RandomState, optional (default=None)
The random state for the split
Methods
get_n_splits
()Get the number of splits or folds. split
(frame, y)Split the frame with stratification.
-
class
skutil.h2o.
H2OStratifiedShuffleSplit
(n_splits=2, test_size=0.1, train_size=None, random_state=None)[source]¶ Bases:
skutil.h2o.split.H2OBaseShuffleSplit
Shuffle splitter used for
h2o_train_test_split
when stratified option is specified. This shuffle split class will perform stratification.Methods
get_n_splits
()Get the number of splits or folds for this instance of the shuffle split. split
(frame, y)Split the frame with stratification.
-
class
skutil.h2o.
H2OUndersamplingClassBalancer
(target_feature, ratio=0.2, shuffle=True)[source]¶ Bases:
skutil.h2o.balance._BaseH2OBalancer
Undersample the majority class until it is represented at the target proportion to the most-represented minority class.
Parameters: target_feature : str
The name of the response column. The response column must be more than a single class and less than
skutil.preprocessing.balance.BalancerMixin._max_classes
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input frame
shuffle : bool, optional (default=True)
Whether or not to shuffle rows on return
Examples
Consider the following example: with a
ratio
of 0.5, the majority class (0) will be undersampled until the second most-populous class (1) is represented at a ratio of 0.5.>>> def example(): ... import h2o ... import pandas as pd ... import numpy as np ... from skutil.h2o.frame import value_counts ... from skutil.h2o import from_pandas ... ... # initialize h2o ... h2o.init() ... ... # read into pandas ... x = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A']) ... ... # load into h2o ... X = from_pandas(x) ... ... # initialize sampler ... sampler = H2OUndersamplingClassBalancer(target_feature="A", ratio=0.5) ... ... X_balanced = sampler.balance(X) ... value_counts(X_balanced) ... >>> example() 0 60 1 30 2 10 Name A, dtype: int64
New in version 0.1.0.
Methods
balance
(X)Apply the undersampling balance operation. get_params
([deep])Get parameters for this estimator. load
(location)Loads a persisted state of an instance of BaseH2OFunctionWrapper
from disk.save
(location[, warn_if_exists])Saves the BaseH2OFunctionWrapper
to disk.set_params
(\*\*params)Set the parameters of this estimator. -
balance
(X)[source]¶ Apply the undersampling balance operation. Undersamples the majority class to the provided ratio of minority class(es) : majority class
Parameters: X :
H2OFrame
, shape=(n_samples, n_features)The imbalanced dataset.
Returns: Xb :
H2OFrame
, shape=(n_samples, n_features)The balanced H2OFrame
-
-
exception
skutil.h2o.
NAWarning
[source]¶ Bases:
exceptions.UserWarning
Custom warning used to notify user that an NA exists within an h2o frame (h2o can handle NA values)
-
class
skutil.h2o.
VizMixin
[source]¶ This mixin class provides the interface to plot an
H2OEstimator
‘s fit performance over a timestep. Any structure that wraps an H2OEstimator’s fitting functionality should derive from this mixin.Methods
plot
(timestep, metric)Plot an H2OEstimator’s performance over a given timestep
(x-axis) against a providedmetric
(y-axis).-
plot
(timestep, metric)[source]¶ Plot an H2OEstimator’s performance over a given
timestep
(x-axis) against a providedmetric
(y-axis).Parameters: timestep : str
A timestep as defined in the H2O API. One of (“AUTO”, “duration”, “number_of_trees”).
metric : str
The performance metric to evaluate. One of (“log_likelihood”, “objective”, “MSE”, “AUTO”)
-
-
skutil.h2o.
as_series
(x)[source]¶ Make a 1d H2OFrame into a pd.Series.
Parameters: x :
H2OFrame
, shape=(n_samples, 1)The H2OFrame
Returns: x : Pandas
Series
, shape=(n_samples,)The pandas series
-
skutil.h2o.
check_cv
(cv=3)[source]¶ Checks the
cv
parameter to determine whether it’s a valid int or H2OBaseCrossValidator.Parameters: cv : int or H2OBaseCrossValidator, optional (default=3)
The number of folds or the H2OBaseCrossValidator instance.
Returns: cv : H2OBaseCrossValidator
The instance of H2OBaseCrossValidator
-
skutil.h2o.
check_frame
(X, copy=False)[source]¶ Returns
X
ifX
is an H2OFrame else raises a TypeError. Ifcopy
is True, will return a copy ofX
instead.Parameters: X :
H2OFrame
, shape=(n_samples, n_features)The frame to evaluate
copy : bool, optional (default=False)
Whether to return a copy of the H2OFrame.
Returns: X :
H2OFrame
, shape=(n_samples, n_features)The frame or the copy
-
skutil.h2o.
check_version
(min_version, max_version)[source]¶ Ensures the currently installed/running version of h2o is compatible with the min_version and max_version the function in question calls for.
Parameters: min_version : str, float
The minimum version of h2o that is compatible with the transformer
max_version : str, float
The maximum version of h2o that is compatible with the transformer
-
skutil.h2o.
from_array
(X, column_names=None)[source]¶ A simple wrapper for H2OFrame.from_python. This takes a numpy array (or 2d array) and returns an H2OFrame with all the default args.
Parameters: X : ndarray
The array to convert.
column_names : list, tuple (default=None)
the names to use for your columns
Returns: H2OFrame :
-
skutil.h2o.
from_pandas
(X)[source]¶ A simple wrapper for H2OFrame.from_python. This takes a pandas dataframe and returns an H2OFrame with all the default args (generally enough) plus named columns.
Parameters: X : pd.DataFrame
The dataframe to convert.
Returns: H2OFrame :
-
skutil.h2o.
h2o_accuracy_score
(y_actual, y_predict, normalize=True, sample_weight=None, y_type=None)[source]¶ Accuracy classification score for H2O
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
normalize : bool, optional (default=True)
Whether to average the data
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: score : float
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_auc_score
(y_actual, y_predict, average='macro', sample_weight=None, y_type=None)[source]¶ Compute Area Under the Curve (AUC) using the trapezoidal rule. This implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
NOTE: using H2OFrames, this would require moving the predict vector locally for each task in the average binary score task. It’s more efficient simply to bring both vectors local, and then use the sklearn h2o score. That’s what we’ll do for now.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
average : string, optional (default=’macro’)
One of [None, ‘micro’, ‘macro’ (default), ‘samples’, ‘weighted’]. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'micro'
:Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples'
:Calculate metrics for each instance, and find their average.
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: auc : float
.. versionadded:: 0.1.6 :
-
skutil.h2o.
h2o_bincount
(bins, weights=None, minlength=None)[source]¶ Given a 1d column of non-negative ints,
bins
, return a np.ndarray of positional counts of each int.Parameters: bins : H2OFrame
The values
weights : list or H2OFrame, optional (default=None)
The weights with which to weight the output
minlength : int, optional (default=None)
The min length of the output array
-
skutil.h2o.
h2o_col_to_numpy
(column)[source]¶ Return a 1d numpy array from a single H2OFrame column.
Parameters: column : H2OFrame column, shape=(n_samples, 1)
A column from an H2OFrame
Returns: np.ndarray, shape=(n_samples,) :
-
skutil.h2o.
h2o_corr_plot
(X, plot_type='cor', cmap='Blues_d', n_levels=5, figsize=(11, 9), cmap_a=220, cmap_b=10, vmax=0.3, xticklabels=5, yticklabels=5, linewidths=0.5, cbar_kws={'shrink': 0.5}, use='complete.obs', na_warn=True, na_rm=False)[source]¶ Create a simple correlation plot given a dataframe. Note that this requires all datatypes to be numeric and finite!
Parameters: X : H2OFrame, shape=(n_samples, n_features)
The H2OFrame
plot_type : str, optional (default=’cor’)
The type of plot, one of (‘cor’, ‘kde’, ‘pair’)
cmap : str, optional (default=’Blues_d’)
The color to use for the kernel density estimate plot if plot_type == ‘kde’
n_levels : int, optional (default=5)
The number of levels to use for the kde plot if plot_type == ‘kde’
figsize : tuple (int), optional (default=(11,9))
The size of the image
cmap_a : int, optional (default=220)
The colormap start point
cmap_b : int, optional (default=10)
The colormap end point
vmax : float, optional (default=0.3)
Arg for seaborn heatmap
xticklabels : int, optional (default=5)
The spacing for X ticks
yticklabels : int, optional (default=5)
The spacing for Y ticks
linewidths : float, optional (default=0.5)
The width of the lines
cbar_kws : dict, optional
Any KWs to pass to seaborn’s heatmap when plot_type = ‘cor’
use : str, optional (default=’complete.obs’)
The “use” to compute the correlation matrix
na_warn : bool, optional (default=True)
Whether to warn in the presence of NA values
na_rm : bool, optional (default=False)
Whether to remove NAs
-
skutil.h2o.
h2o_f1_score
(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]¶ Compute the F1 score, the weighted average of the precision and the recall:
F1 = 2 * (precision * recall) / (precision + recall)
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
labels : list, optional (default=None)
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. By default all labels iny_actual
andy_predict
are used in sorted order.pos_label : str or int, optional (default=1)
The class to report if
average=='binary'
and the data is binary. If the data are multiclass, this will be ignored.average : str, optional (default=’binary’)
One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
sample_weight : H2OFrame or float, optional (default=None)
The sample weights
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: f : float
The F-1 score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_f_classif
(X, feature_names, target_feature)[source]¶ Compute the ANOVA F-value for the provided sample. This method is adapted from
sklearn.feature_selection.f_classif
to function on H2OFrames.Parameters: X :
H2OFrame
, shape=(n_samples, n_features)The feature matrix. Each feature will be tested sequentially.
feature_names : array_like (str), optional (default=None)
The list of names on which to fit the transformer.
target_feature : str, optional (default=None)
The name of the target feature (is excluded from the fit) for the estimator.
Returns: f : float
The computed F-value of the test.
prob : float
The associated p-value from the F-distribution.
.. versionadded:: 0.1.2 :
-
skutil.h2o.
h2o_f_oneway
(*args)[source]¶ Performs a 1-way ANOVA. The one-way ANOVA tests the null hypothesis that 2 or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.
Parameters: sample1, sample2, ... : array_like, H2OFrames, shape=(n_classes,)
The sample measurements should be given as varargs (*args). A slice of the original input frame for each class in the target feature.
Returns: f : float
The computed F-value of the test.
prob : float
The associated p-value from the F-distribution.
Notes
The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.
- The samples are independent
- Each sample is from a normally distributed population
- The population standard deviations of the groups are all equal. This property is known as homoscedasticity.
If these assumptions are not true for a given set of data, it may still be possible to use the Kruskal-Wallis H-test (
scipy.stats.kruskal
) although with some loss of power.The algorithm is from Heiman[2], pp.394-7. See
scipy.stats.f_oneway
andsklearn.feature_selection.f_oneway
.References
[R5] Lowry, Richard. “Concepts and Applications of Inferential Statistics”. Chapter 14. http://faculty.vassar.edu/lowry/ch14pt1.html [R6] Heiman, G.W. Research Methods in Statistics. 2002. New in version 0.1.2.
-
skutil.h2o.
h2o_fbeta_score
(y_actual, y_predict, beta, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]¶ Compute the F-beta score. The F-beta score is the weighted harmonic mean of precision and recall.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
beta : float
The beta value for the F-score
labels : list, optional (default=None)
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. By default all labels iny_actual
andy_predict
are used in sorted order.pos_label : str or int, optional (default=1)
The class to report if
average=='binary'
and the data is binary. If the data are multiclass, this will be ignored.average : str, optional (default=’binary’)
One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
sample_weight : H2OFrame or float, optional (default=None)
The sample weights
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: f : float
The F-beta score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_frame_memory_estimate
(X, bit_est=32, unit='MB')[source]¶ We estimate the memory footprint of an H2OFrame to determine, possibly, whether it’s capable of being held in memory or not.
Parameters: X : H2OFrame
The H2OFrame in question
bit_est : int, optional (default=32)
The estimated bit-size of each cell. The default assumes each cell is a signed 32-bit float
unit : str, optional (default=’MB’)
The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)
Returns: mb : str
The estimated number of UNIT held in the frame
-
skutil.h2o.
h2o_log_loss
(y_actual, y_predict, eps=1e-15, normalize=True, sample_weight=None, y_type=None)[source]¶ Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is
-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))This method is adapted from the
sklearn.metrics.classification.log_loss
function for use with ``H2OFrame``s in skutil.Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples, [n_classes])The predicted labels. Can represent a matrix. If
y_predict.shape = (n_samples,)
the probabilities provided are assumed to be that of the positive class. The labels iny_predict
are assumed to be ordered ordinally.eps : float, optional (default=1e-15)
Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).
normalize : bool, optional (default=True)
If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: loss : float
Notes
The logarithm used is the natural logarithm (base-e).
New in version 0.1.6.
-
skutil.h2o.
h2o_mean_absolute_error
(y_actual, y_predict, sample_weight=None, y_type=None)[source]¶ Mean absolute error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: score : float
The mean absolute error
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_mean_squared_error
(y_actual, y_predict, sample_weight=None, y_type=None)[source]¶ Mean squared error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: score : float
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_median_absolute_error
(y_actual, y_predict, sample_weight=None, y_type=None)[source]¶ Median absolute error score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: score : float
The median absolute error score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_precision_score
(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]¶ Compute the precision. Precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives.Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
labels : list, optional (default=None)
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. By default all labels iny_actual
andy_predict
are used in sorted order.pos_label : str or int, optional (default=1)
The class to report if
average=='binary'
and the data is binary. If the data are multiclass, this will be ignored.average : str, optional (default=’binary’)
One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
sample_weight : H2OFrame or float, optional (default=None)
The sample weights
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: p : float
The precision score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_r2_score
(y_actual, y_predict, sample_weight=None, y_type=None)[source]¶ R^2 score for H2O frames. Provides fast computation in a distributed fashion without loading all of the data into memory.
Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
sample_weight : H2OFrame or float, optional (default=None)
A frame of sample weights of matching dims with y_actual and y_predict.
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: score : float
The R^2 score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_recall_score
(y_actual, y_predict, labels=None, pos_label=1, average='binary', sample_weight=None, y_type=None)[source]¶ Compute the recall
Precision is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives.Parameters: y_actual :
H2OFrame
, shape=(n_samples,)The one-dimensional ground truth
y_predict :
H2OFrame
, shape=(n_samples,)The one-dimensional predicted labels
labels : list, optional (default=None)
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. By default all labels iny_actual
andy_predict
are used in sorted order.pos_label : str or int, optional (default=1)
The class to report if
average=='binary'
and the data is binary. If the data are multiclass, this will be ignored.average : str, optional (default=’binary’)
One of (‘binary’, ‘micro’, ‘macro’, ‘weighted’). This parameter is required for multiclass targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
sample_weight : H2OFrame, optional (default=None)
The sample weights
y_type : string, optional (default=None)
The type of the column. If None, will be determined.
Returns: r : float
The recall score
.. versionadded:: 0.1.0 :
-
skutil.h2o.
h2o_train_test_split
(frame, test_size=None, train_size=None, random_state=None, stratify=None)[source]¶ Splits an H2OFrame into random train and test subsets
Parameters: frame : H2OFrame
The h2o frame to split
test_size : float, int, or None (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25
train_size : float, int, or None (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
stratify : str or None (default=None)
The name of the target on which to stratify the sampling
Returns: out : tuple, shape=(2,)
- training_frame : H2OFrame
The training fold split
- testing_frame : H2OFrame
The testing fold split
-
skutil.h2o.
is_float
(x)[source]¶ Determine whether a 1d H2OFrame is made up of floats.
Parameters: x : H2OFrame, shape=(n_samples, 1)
The H2OFrame
Returns: bool : True if float, else False
-
skutil.h2o.
is_integer
(x)[source]¶ Determine whether a 1d H2OFrame is made up of integers.
Parameters: x : H2OFrame, shape=(n_samples, 1)
The H2OFrame
Returns: bool : True if integers, else False
-
skutil.h2o.
is_numeric
(x)[source]¶ Determine whether a 1d H2OFrame is numeric.
Parameters: x : H2OFrame, shape=(n_samples, 1)
The H2OFrame
Returns: bool : True if numeric, else False
-
skutil.h2o.
load_boston_h2o
(include_tgt=True, tgt_name='target', shuffle=False)[source]¶ Load the boston housing dataset into an H2OFrame
Parameters: include_tgt : bool, optional (default=True)
Whether or not to include the target
tgt_name : str, optional (default=”target”)
The name of the target column.
shuffle : bool, optional (default=False)
Whether or not to shuffle the data
-
skutil.h2o.
load_breast_cancer_h2o
(include_tgt=True, tgt_name='target', shuffle=False)[source]¶ Load the breast cancer dataset into an H2OFrame
Parameters: include_tgt : bool, optional (default=True)
Whether or not to include the target
tgt_name : str, optional (default=”target”)
The name of the target column.
shuffle : bool, optional (default=False)
Whether or not to shuffle the data
-
skutil.h2o.
load_iris_h2o
(include_tgt=True, tgt_name='Species', shuffle=False)[source]¶ Load the iris dataset into an H2OFrame
Parameters: include_tgt : bool, optional (default=True)
Whether or not to include the target
tgt_name : str, optional (default=”Species”)
The name of the target column.
shuffle : bool, optional (default=False)
Whether or not to shuffle the data
-
skutil.h2o.
make_h2o_scorer
(score_function, y_actual)[source]¶ Make a scoring function from a callable. The signature for the callable should resemble:
some_function(y_actual=y_actual, y_predict=y_pred, y_type=None, **kwargs)
Parameters: score_function : callable
The function
y_actual :
H2OFrame
, shape=(n_samples,)A one-dimensional
H2OFrame
(the ground truth). This is used to determine before hand whether the type is binary or multiclass.Returns: score_class :
_H2OScorer
An instance of
_H2OScorer
whosescore
method will be used for scoring in theskutil.h2o.grid_search
module... versionadded:: 0.1.0 :
-
skutil.h2o.
rbind_all
(*args)[source]¶ Given a variable set of H2OFrames, rbind all of them into a single H2OFrame.
Parameters: array1, array2, ... : H2OFrame, shape=(n_samples, n_features)
The H2OFrames to rbind. All should match in column dimensionality.
Returns: f : H2OFrame
The rbound H2OFrame
-
skutil.h2o.
reorder_h2o_frame
(X, idcs, from_chunks=False)[source]¶ Currently, H2O does not allow us to reorder frames. This is a hack to rbind rows together in the order prescribed.
Parameters: X : H2OFrame
The H2OFrame to reorder
idcs : iterable
The order of the H2OFrame rows to be returned.
from_chunks : bool, optional (default=False)
Whether the elements in
idcs
are optimized chunks generated by_gen_optimized_chunks
.Returns: new_frame : H2OFrame
The reordered H2OFrame
-
skutil.h2o.
shuffle_h2o_frame
(X)[source]¶ Currently, H2O does not allow us to shuffle frames. This is a hack to rbind rows together in the order prescribed.
Parameters: X : H2OFrame
The H2OFrame to reorder
Returns: shuf : H2OFrame
The shuffled H2OFrame
-
skutil.h2o.
validate_x
(x)[source]¶ Given an iterable or None,
x
, validate that if it is an iterable, it only contains string types.Parameters: x : None or iterable, shape=(n_features,)
The feature names
Returns: x : None or iterable, shape=(n_features,)
The feature names
-
skutil.h2o.
validate_x_y
(X, feature_names, target_feature, exclude_features=None)[source]¶ Validate the feature_names and target_feature arguments passed to an H2OTransformer.
Parameters: X :
H2OFrame
, shape=(n_samples, n_features)The frame from which to drop
feature_names : iterable or None
The feature names to be used in a transformer. If feature_names is None, the transformer will use all of the frame’s column names. However, if the feature_names are an iterable, they must all be either strings or unicode names of columns in the frame.
target_feature : str, unicode or None
The target name to exclude from the transformer analysis. If None, unsupervised is assumed, otherwise must be string or unicode.
exclude_features : iterable or None, optional (default=None)
Any names that should be excluded from
x
Returns: feature_names : list, str
A list of the
feature_names
as stringstarget_feature : str or None
The
target_feature
as a string if it is not None, else None