skutil.utils module

skutil.utils provides common utilitarian functionality for the skutil library. skutil.utils.fixes adds adaptations for bridging the scikit-learn 0.17 to 0.18 behavior. skutil.utils.metaestimators adapts scikit-learns metaestimator for more specific use of skutil.

skutil.utils.corr_plot(X, plot_type='cor', cmap='Blues_d', n_levels=5, corr=None, method='pearson', figsize=(11, 9), cmap_a=220, cmap_b=10, vmax=0.3, xticklabels=5, yticklabels=5, linewidths=0.5, cbar_kws=None)[source]

Create a simple correlation plot given a dataframe. Note that this requires all datatypes to be numeric and finite!

Parameters:

X : pd.DataFrame, shape=(n_samples, n_features)

The pandas DataFrame on which to compute correlations, or if corr is ‘precomputed’, the correlation matrix. In the case that X is a correlation matrix, it must be square, i.e., shape=(n_features, n_features).

plot_type : str, optional (default=’cor’)

The type of plot, one of (‘cor’, ‘kde’, ‘pair’)

cmap : str, optional (default=’Blues_d’)

The color to use for the kernel density estimate plot if plot_type == ‘kde’. Otherwise unused.

n_levels : int, optional (default=5)

The number of levels to use for the kde plot if plot_type == ‘kde’. Otherwise unused.

corr : ‘precomputed’ or None, optional (default=None)

If None, the correlation matrix is computed, otherwise if ‘precomputed’, X is treated as a correlation matrix.

method : str, optional (default=’pearson’)

The method to use for correlation

figsize : tuple (int), shape=(w,h), optional (default=(11,9))

The size of the image

cmap_a : int, optional (default=220)

The colormap start point

cmap_b : int, optional (default=10)

The colormap end point

vmax : float, optional (default=0.3)

Arg for seaborn heatmap

xticklabels : int, optional (default=5)

The spacing for X ticks

yticklabels : int, optional (default=5)

The spacing for Y ticks

linewidths : float, optional (default=0.5)

The width of the lines

cbar_kws : dict, optional (default=None)

Any KWs to pass to seaborn’s heatmap when plot_type = ‘cor’. If None, will default to {‘shrink’: 0.5}

skutil.utils.df_memory_estimate(X, unit='MB', index=False)[source]

We estimate the memory footprint of an H2OFrame to determine whether it’s capable of being held in memory or not.

Parameters:

X : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The DataFrame in question

unit : str, optional (default=’MB’)

The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)

index : bool, optional (default=False)

Whether to also estimate the memory footprint of the index.

Returns:

mb : str

The estimated number of UNIT held in the frame

skutil.utils.dict_keys(d)[source]

In python 3, the d.keys() method returns a view and not an actual list.

Parameters:

d : dict

The dictionary

Returns:

list :

skutil.utils.dict_values(d)[source]

In python 3, the d.values() method returns a view and not an actual list.

Parameters:

d : dict

The dictionary

Returns:

list :

skutil.utils.exp(x)[source]

A safe mechanism for computing the exponential function while avoiding overflows.

Parameters:

x : float, number

The number for which to compute the exp

Returns:

exp(x) :

skutil.utils.flatten_all(container)[source]

Recursively flattens an arbitrarily nested iterable. WARNING: this function may produce a list of mixed types.

Parameters:

container : array_like, shape=(n_items,)

The iterable to flatten. If the container is not iterable, it will be returned in a list as [container]

Returns:

l : list, shape=(n_items,)

The flattened list

Examples

The example below produces a list of mixed results:

>>> a = [[[],3,4],['1','a'],[[[1]]],1,2]
>>> flatten_all(a)
[3, 4, '1', 'a', 1, 1, 2]
skutil.utils.flatten_all_generator(container)[source]

Recursively flattens an arbitrarily nested iterable. WARNING: this function may produce a list of mixed types.

Parameters:

container : array_like, shape=(n_items,)

The iterable to flatten. If the container is not iterable, it will be returned in a list as [container]

Returns:

generator object :

Examples

The example below produces a list of mixed results:

>>> a = [[[],3,4],['1','a'],[[[1]]],1,2]
>>> flatten_all(a) # yields a generator for this iterable
[3, 4, '1', 'a', 1, 1, 2]
skutil.utils.get_numeric(X)[source]

Return list of indices of numeric dtypes variables

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The dataframe

Returns:

list, int :

The list of indices which are numeric.

skutil.utils.human_bytes(b, unit='MB')[source]

Get bytes in a human readable form

Parameters:

b : int

The number of bytes

unit : str, optional (default=’MB’)

The units to report. One of (‘MB’, ‘KB’, ‘GB’, ‘TB’)

Returns:

mb : str

The estimated number of UNIT held in the frame

skutil.utils.is_entirely_numeric(X)[source]

Determines whether an entire pandas frame is numeric in dtypes.

Parameters:

X : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The dataframe to test

Returns:

bool :

True if the entire pd.DataFrame is numeric else False

skutil.utils.is_float(x)[source]

Determine whether some object x is a float type (float, np.float, etc).

Parameters:

x : object

The item to assess

Returns:

bool :

True if x is a float type

skutil.utils.is_integer(x)[source]

Determine whether some object x is an integer type (int, long, etc).

Parameters:

x : object

The item to assess

Returns:

bool :

True if x is an integer type

skutil.utils.is_iterable(x)[source]

Python 3.x adds the __iter__ attribute to strings. Thus, our previous tests for iterable will fail when using hasattr.

Parameters:

x : object

The object or primitive to test whether or not is an iterable.

Returns:

bool :

True if x is an iterable

skutil.utils.is_numeric(x)[source]

Determine whether some object x is a numeric type (float, int, etc).

Parameters:

x : object

The item to assess

Returns:

bool :

True if x is a float or integer type

skutil.utils.load_boston_df(include_tgt=True, tgt_name='target', shuffle=False)[source]

Loads the boston housing dataset into a dataframe with the target set as the “target” feature or whatever name is specified in tgt_name.

Parameters:

include_tgt : bool, optional (default=True)

Whether to include the target

tgt_name : str, optional (default=”target”)

The name of the target feature

shuffle : bool, optional (default=False)

Whether to shuffle the rows

Returns:

X : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The loaded dataset

skutil.utils.load_breast_cancer_df(include_tgt=True, tgt_name='target', shuffle=False)[source]

Loads the breast cancer dataset into a dataframe with the target set as the “target” feature or whatever name is specified in tgt_name.

Parameters:

include_tgt : bool, optional (default=True)

Whether to include the target

tgt_name : str, optional (default=”target”)

The name of the target feature

shuffle : bool, optional (default=False)

Whether to shuffle the rows

Returns:

X : pd.DataFrame, shape=(n_samples, n_features)

The loaded dataset

skutil.utils.load_iris_df(include_tgt=True, tgt_name='Species', shuffle=False)[source]

Loads the iris dataset into a dataframe with the target set as the “Species” feature or whatever name is specified in tgt_name.

Parameters:

include_tgt : bool, optional (default=True)

Whether to include the target

tgt_name : str, optional (default=”Species”)

The name of the target feature

shuffle : bool, optional (default=False)

Whether to shuffle the rows on return

Returns:

X : pd.DataFrame, shape=(n_samples, n_features)

The loaded dataset

skutil.utils.log(x)[source]

A safe mechanism for computing a log while avoiding NaNs or exceptions.

Parameters:

x : float, number

The number for which to compute the log

Returns:

log(x) :

skutil.utils.pd_stats(X, col_type='all', na_str='--', hi_skew_thresh=1.0, mod_skew_thresh=0.5)[source]

Get a descriptive report of the elements in the data frame. Builds on existing pandas describe method by adding counts of factor-level features, a skewness rating and several other helpful statistics.

Parameters:

X : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The DataFrame on which to compute stats.

col_type : str, optional (default=’all’)

The types of columns to analyze. One of (‘all’, ‘numeric’, ‘object’). If not all, will only return corresponding typed columns.

na_str : str, optional (default=’–’)

The string to display in a cell that is not applicable for the column’s datatype.

hi_skew_thresh : float, optional (default=1.0)

The threshold above which a skewness rating will be deemed “high.”

mod_skew_thresh : float, optional (default=0.5)

The threshold above which a skewness rating will be deemed “moderate,” so long as it does not exceed hi_skew_thresh

Returns:

s : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The resulting stats dataframe

skutil.utils.report_confusion_matrix(actual, pred, return_metrics=True)[source]

Return a dataframe with the confusion matrix, and a series with the classification performance metrics.

Parameters:

actual : np.ndarray, shape=(n_samples,)

The array of actual values

pred : np.ndarray, shape=(n_samples,)

The array of predicted values

return_metrics : bool, optional (default=True)

Whether to return the metrics in a pd.Series. If False, index 1 of the returned tuple will be None.

Returns:

conf : pd.DataFrame, shape=(2, 2)

The confusion matrix

ser : pd.Series or None

The metrics if return_metrics else None

skutil.utils.report_grid_score_detail(random_search, charts=True, sort_results=True, ascending=True, percentile=0.975, y_axis='mean_test_score', sort_by='mean_test_score', highlight_best=True, highlight_col='red', def_color='blue', return_drops=False)[source]

Return plots and dataframe of results, given a fitted grid search. Note that if Matplotlib is not installed, a warning will be thrown and no plots will be generated.

Parameters:

random_search : BaseSearchCV or BaseH2OSearchCV

The fitted grid search

charts : bool, optional (default=True)

Whether to plot the charts

sort_results : bool, optional (default=True)

Whether to sort the results based on score

ascending : bool, optional (default=True)

If sort_results is True, whether to use asc or desc in the sorting process.

percentile : float, optional (default=0.975)

The percentile point (0 < percentile < 1.0). The corresponding z-score will be multiplied by the cross validation score standard deviations.

y_axis : str, optional (default=’mean_test_score’)

The y-axis of the charts. One of (‘score’,’std’)

sort_by : str, optional (default=’mean_test_score’)

The column to sort by. This is not validated, in case the user wants to sort by a parameter column. If not sort_results, this is unused.

highlight_best : bool, optional (default=True)

If set to True, charts is True, and sort_results is also True, then highlights the point in the top position of the model DF.

highlight_col : str, optional (default=’red’)

What color to use for highlight_best if both charts and highlight_best. If either is False, this is unused.

def_color : str, optional (default=’blue’)

What color to use for the points if charts is True. This should differ from highlight_col, but no validation is performed.

return_drops : bool, optional (default=False)

If True, will return the list of names that can be dropped out (i.e., were generated by sklearn and are not parameters of interest).

Returns:

result_df : Pandas DataFrame or H2OFrame, shape=(n_samples, n_features)

The grid search results

drops : list

List of sklearn-generated names. Only returned if return_drops is True.

skutil.utils.shuffle_dataframe(X)[source]

Shuffle the rows in a data frame without replacement. The random state used for shuffling is controlled by numpy’s random state.

Parameters:

X : pd.DataFrame, shape=(n_samples, n_features)

The dataframe to shuffle

skutil.utils.validate_is_pd(X, cols, assert_all_finite=False)[source]

Used within each SelectiveMixin fit method to determine whether the passed X is a dataframe, and whether the cols is appropriate. There are four scenarios (in the order in which they’re checked):

  1. Names is not None, but X is not a dataframe.

    Resolution: the method will attempt to return a DataFrame from the args provided (with default names), but catches any exception and raises a ValueError. A common case where this would work may be a numpy.ndarray as X, and a list as cols (where the list is either int indices or default names that the dataframe will take on).

  2. X is a DataFrame, but cols is None.

    Resolution: return a copy of the dataframe, and use all column names.

  3. X is a DataFrame and cols is not None.

    Return a copy of the dataframe, and use only the names provided. This is the typical use case.

  4. X is not a DataFrame, and cols is None.

    Resolution: this case will only work if the X can be built into a DataFrame. Otherwise, there will be a ValueError thrown.

Parameters:

X : array_like, shape=(n_samples, n_features)

The dataframe to validate. If X is not a DataFrame, but it can be made into one, no exceptions will be raised. However, if X cannot naturally be made into a DataFrame, a TypeError will be raised.

cols : array_like (str), shape=(n_features,)

The list of column names. Used particularly in SelectiveMixin transformers that validate column names.

assert_all_finite : bool, optional (default=False)

If True, will raise an AssertionError if any np.nan or np.inf values reside in X.

Returns:

X : pd.DataFrame, shape=(n_samples, n_features)

A copy of the original input X

cols : list or None, shape=(n_features,)

If cols was not None and did not raise a TypeError, it is converted into a list of strings and returned as a copy. Else None.