skutil.utils module¶
skutil.utils provides common utilitarian functionality for the skutil library. skutil.utils.fixes adds adaptations for bridging the scikit-learn 0.17 to 0.18 behavior. skutil.utils.metaestimators adapts scikit-learns metaestimator for more specific use of skutil.
-
skutil.utils.
corr_plot
(X, plot_type='cor', cmap='Blues_d', n_levels=5, corr=None, method='pearson', figsize=(11, 9), cmap_a=220, cmap_b=10, vmax=0.3, xticklabels=5, yticklabels=5, linewidths=0.5, cbar_kws=None)[source]¶ Create a simple correlation plot given a dataframe. Note that this requires all datatypes to be numeric and finite!
Parameters: X : pd.DataFrame, shape=(n_samples, n_features)
The pandas DataFrame on which to compute correlations, or if
corr
is ‘precomputed’, the correlation matrix. In the case thatX
is a correlation matrix, it must be square, i.e., shape=(n_features, n_features).plot_type : str, optional (default=’cor’)
The type of plot, one of (‘cor’, ‘kde’, ‘pair’)
cmap : str, optional (default=’Blues_d’)
The color to use for the kernel density estimate plot if
plot_type
== ‘kde’. Otherwise unused.n_levels : int, optional (default=5)
The number of levels to use for the kde plot if
plot_type
== ‘kde’. Otherwise unused.corr : ‘precomputed’ or None, optional (default=None)
If None, the correlation matrix is computed, otherwise if ‘precomputed’,
X
is treated as a correlation matrix.method : str, optional (default=’pearson’)
The method to use for correlation
figsize : tuple (int), shape=(w,h), optional (default=(11,9))
The size of the image
cmap_a : int, optional (default=220)
The colormap start point
cmap_b : int, optional (default=10)
The colormap end point
vmax : float, optional (default=0.3)
Arg for seaborn heatmap
xticklabels : int, optional (default=5)
The spacing for X ticks
yticklabels : int, optional (default=5)
The spacing for Y ticks
linewidths : float, optional (default=0.5)
The width of the lines
cbar_kws : dict, optional (default=None)
Any KWs to pass to seaborn’s heatmap when
plot_type
= ‘cor’. If None, will default to {‘shrink’: 0.5}
-
skutil.utils.
df_memory_estimate
(X, unit='MB', index=False)[source]¶ We estimate the memory footprint of an H2OFrame to determine whether it’s capable of being held in memory or not.
Parameters: X : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The DataFrame in question
unit : str, optional (default=’MB’)
The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)
index : bool, optional (default=False)
Whether to also estimate the memory footprint of the index.
Returns: mb : str
The estimated number of UNIT held in the frame
-
skutil.utils.
dict_keys
(d)[source]¶ In python 3, the
d.keys()
method returns a view and not an actual list.Parameters: d : dict
The dictionary
Returns: list :
-
skutil.utils.
dict_values
(d)[source]¶ In python 3, the
d.values()
method returns a view and not an actual list.Parameters: d : dict
The dictionary
Returns: list :
-
skutil.utils.
exp
(x)[source]¶ A safe mechanism for computing the exponential function while avoiding overflows.
Parameters: x : float, number
The number for which to compute the exp
Returns: exp(x) :
-
skutil.utils.
flatten_all
(container)[source]¶ Recursively flattens an arbitrarily nested iterable. WARNING: this function may produce a list of mixed types.
Parameters: container : array_like, shape=(n_items,)
The iterable to flatten. If the
container
is not iterable, it will be returned in a list as[container]
Returns: l : list, shape=(n_items,)
The flattened list
Examples
The example below produces a list of mixed results:
>>> a = [[[],3,4],['1','a'],[[[1]]],1,2] >>> flatten_all(a) [3, 4, '1', 'a', 1, 1, 2]
-
skutil.utils.
flatten_all_generator
(container)[source]¶ Recursively flattens an arbitrarily nested iterable. WARNING: this function may produce a list of mixed types.
Parameters: container : array_like, shape=(n_items,)
The iterable to flatten. If the
container
is not iterable, it will be returned in a list as[container]
Returns: generator object :
Examples
The example below produces a list of mixed results:
>>> a = [[[],3,4],['1','a'],[[[1]]],1,2] >>> flatten_all(a) # yields a generator for this iterable [3, 4, '1', 'a', 1, 1, 2]
-
skutil.utils.
get_numeric
(X)[source]¶ Return list of indices of numeric dtypes variables
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The dataframe
Returns: list, int :
The list of indices which are numeric.
-
skutil.utils.
human_bytes
(b, unit='MB')[source]¶ Get bytes in a human readable form
Parameters: b : int
The number of bytes
unit : str, optional (default=’MB’)
The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)
Returns: mb : str
The estimated number of UNIT held in the frame
-
skutil.utils.
is_entirely_numeric
(X)[source]¶ Determines whether an entire pandas frame is numeric in dtypes.
Parameters: X : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The dataframe to test
Returns: bool :
True if the entire pd.DataFrame is numeric else False
-
skutil.utils.
is_float
(x)[source]¶ Determine whether some object
x
is a float type (float, np.float, etc).Parameters: x : object
The item to assess
Returns: bool :
True if
x
is a float type
-
skutil.utils.
is_integer
(x)[source]¶ Determine whether some object
x
is an integer type (int, long, etc).Parameters: x : object
The item to assess
Returns: bool :
True if
x
is an integer type
-
skutil.utils.
is_iterable
(x)[source]¶ Python 3.x adds the
__iter__
attribute to strings. Thus, our previous tests for iterable will fail when usinghasattr
.Parameters: x : object
The object or primitive to test whether or not is an iterable.
Returns: bool :
True if
x
is an iterable
-
skutil.utils.
is_numeric
(x)[source]¶ Determine whether some object
x
is a numeric type (float, int, etc).Parameters: x : object
The item to assess
Returns: bool :
True if
x
is a float or integer type
-
skutil.utils.
load_boston_df
(include_tgt=True, tgt_name='target', shuffle=False)[source]¶ Loads the boston housing dataset into a dataframe with the target set as the “target” feature or whatever name is specified in
tgt_name
.Parameters: include_tgt : bool, optional (default=True)
Whether to include the target
tgt_name : str, optional (default=”target”)
The name of the target feature
shuffle : bool, optional (default=False)
Whether to shuffle the rows
Returns: X : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The loaded dataset
-
skutil.utils.
load_breast_cancer_df
(include_tgt=True, tgt_name='target', shuffle=False)[source]¶ Loads the breast cancer dataset into a dataframe with the target set as the “target” feature or whatever name is specified in
tgt_name
.Parameters: include_tgt : bool, optional (default=True)
Whether to include the target
tgt_name : str, optional (default=”target”)
The name of the target feature
shuffle : bool, optional (default=False)
Whether to shuffle the rows
Returns: X : pd.DataFrame, shape=(n_samples, n_features)
The loaded dataset
-
skutil.utils.
load_iris_df
(include_tgt=True, tgt_name='Species', shuffle=False)[source]¶ Loads the iris dataset into a dataframe with the target set as the “Species” feature or whatever name is specified in
tgt_name
.Parameters: include_tgt : bool, optional (default=True)
Whether to include the target
tgt_name : str, optional (default=”Species”)
The name of the target feature
shuffle : bool, optional (default=False)
Whether to shuffle the rows on return
Returns: X : pd.DataFrame, shape=(n_samples, n_features)
The loaded dataset
-
skutil.utils.
log
(x)[source]¶ A safe mechanism for computing a log while avoiding NaNs or exceptions.
Parameters: x : float, number
The number for which to compute the log
Returns: log(x) :
-
skutil.utils.
pd_stats
(X, col_type='all', na_str='--', hi_skew_thresh=1.0, mod_skew_thresh=0.5)[source]¶ Get a descriptive report of the elements in the data frame. Builds on existing pandas
describe
method by adding counts of factor-level features, a skewness rating and several other helpful statistics.Parameters: X : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The DataFrame on which to compute stats.
col_type : str, optional (default=’all’)
The types of columns to analyze. One of (‘all’, ‘numeric’, ‘object’). If not all, will only return corresponding typed columns.
na_str : str, optional (default=’–’)
The string to display in a cell that is not applicable for the column’s datatype.
hi_skew_thresh : float, optional (default=1.0)
The threshold above which a skewness rating will be deemed “high.”
mod_skew_thresh : float, optional (default=0.5)
The threshold above which a skewness rating will be deemed “moderate,” so long as it does not exceed
hi_skew_thresh
Returns: s : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The resulting stats dataframe
-
skutil.utils.
report_confusion_matrix
(actual, pred, return_metrics=True)[source]¶ Return a dataframe with the confusion matrix, and a series with the classification performance metrics.
Parameters: actual : np.ndarray, shape=(n_samples,)
The array of actual values
pred : np.ndarray, shape=(n_samples,)
The array of predicted values
return_metrics : bool, optional (default=True)
Whether to return the metrics in a pd.Series. If False, index 1 of the returned tuple will be None.
Returns: conf : pd.DataFrame, shape=(2, 2)
The confusion matrix
ser : pd.Series or None
The metrics if
return_metrics
else None
-
skutil.utils.
report_grid_score_detail
(random_search, charts=True, sort_results=True, ascending=True, percentile=0.975, y_axis='mean_test_score', sort_by='mean_test_score', highlight_best=True, highlight_col='red', def_color='blue', return_drops=False)[source]¶ Return plots and dataframe of results, given a fitted grid search. Note that if Matplotlib is not installed, a warning will be thrown and no plots will be generated.
Parameters: random_search :
BaseSearchCV
orBaseH2OSearchCV
The fitted grid search
charts : bool, optional (default=True)
Whether to plot the charts
sort_results : bool, optional (default=True)
Whether to sort the results based on score
ascending : bool, optional (default=True)
If
sort_results
is True, whether to use asc or desc in the sorting process.percentile : float, optional (default=0.975)
The percentile point (0 < percentile < 1.0). The corresponding z-score will be multiplied by the cross validation score standard deviations.
y_axis : str, optional (default=’mean_test_score’)
The y-axis of the charts. One of (‘score’,’std’)
sort_by : str, optional (default=’mean_test_score’)
The column to sort by. This is not validated, in case the user wants to sort by a parameter column. If not
sort_results
, this is unused.highlight_best : bool, optional (default=True)
If set to True, charts is True, and sort_results is also True, then highlights the point in the top position of the model DF.
highlight_col : str, optional (default=’red’)
What color to use for
highlight_best
if bothcharts
andhighlight_best
. If either is False, this is unused.def_color : str, optional (default=’blue’)
What color to use for the points if
charts
is True. This should differ fromhighlight_col
, but no validation is performed.return_drops : bool, optional (default=False)
If True, will return the list of names that can be dropped out (i.e., were generated by sklearn and are not parameters of interest).
Returns: result_df : Pandas
DataFrame
orH2OFrame
, shape=(n_samples, n_features)The grid search results
drops : list
List of sklearn-generated names. Only returned if
return_drops
is True.
-
skutil.utils.
shuffle_dataframe
(X)[source]¶ Shuffle the rows in a data frame without replacement. The random state used for shuffling is controlled by numpy’s random state.
Parameters: X : pd.DataFrame, shape=(n_samples, n_features)
The dataframe to shuffle
-
skutil.utils.
validate_is_pd
(X, cols, assert_all_finite=False)[source]¶ Used within each SelectiveMixin fit method to determine whether the passed
X
is a dataframe, and whether the cols is appropriate. There are four scenarios (in the order in which they’re checked):- Names is not None, but X is not a dataframe.
Resolution: the method will attempt to return a DataFrame from the args provided (with default names), but catches any exception and raises a ValueError. A common case where this would work may be a numpy.ndarray as X, and a list as cols (where the list is either int indices or default names that the dataframe will take on).
- X is a DataFrame, but cols is None.
Resolution: return a copy of the dataframe, and use all column names.
- X is a DataFrame and cols is not None.
Return a copy of the dataframe, and use only the names provided. This is the typical use case.
- X is not a DataFrame, and cols is None.
Resolution: this case will only work if the X can be built into a DataFrame. Otherwise, there will be a ValueError thrown.
Parameters: X : array_like, shape=(n_samples, n_features)
The dataframe to validate. If
X
is not a DataFrame, but it can be made into one, no exceptions will be raised. However, ifX
cannot naturally be made into a DataFrame, a TypeError will be raised.cols : array_like (str), shape=(n_features,)
The list of column names. Used particularly in SelectiveMixin transformers that validate column names.
assert_all_finite : bool, optional (default=False)
If True, will raise an AssertionError if any np.nan or np.inf values reside in
X
.Returns: X : pd.DataFrame, shape=(n_samples, n_features)
A copy of the original input
X
cols : list or None, shape=(n_features,)
If
cols
was not None and did not raise a TypeError, it is converted into a list of strings and returned as a copy. Else None.