skutil.preprocessing module

Provides sklearn-esque transformer classes including the Box-Cox transformation and the Yeo-Johnson transformation. Also includes selective scalers and other transformers.

class skutil.preprocessing.BaggedCategoricalImputer(cols=None, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, as_df=True, fill=None)[source]

Bases: skutil.preprocessing.impute._BaseBaggedImputer

Performs imputation on select columns by using BaggingRegressors on the provided columns.

cols
: array_like, optional (default=None)
The columns on which the transformer will be fit. In the case that cols is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.
base_estimator
: object or None, optional (default=None)
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
n_estimators
: int, optional (default=10)
The number of base estimators in the ensemble.
max_samples
: int or float, optional (default=1.0)
The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
max_features
: int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
bootstrap
: boolean, optional (default=True)
Whether samples are drawn with replacement.
bootstrap_features
: boolean, optional (default=False)
Whether features are drawn with replacement.
oob_score
: bool, optional (default=False)
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs
: int, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
random_state
: int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
verbose
: int, optional (default=0)
Controls the verbosity of the building process.
as_df
: bool, optional (default=True)
Whether to return a Pandas DataFrame in the transform method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.
fill
: int, optional (default=None)
the fill to use for missing values in the training matrix when fitting a BaggingClassifier. If None, will default to -999999
Attributes:

models_ : dict, (string

A dictionary mapping column names to the fit bagged estimator.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from skutil.preprocessing import BaggedCategoricalImputer
>>>
>>> nan = np.nan
>>> X = pd.DataFrame.from_records(data=np.array([
...                                 [1.0,  nan,  4.0],
...                                 [nan,  1.0,  nan],
...                                 [2.0,  2.0,  3.0]]), 
...                               columns=['a','b','c'])
>>> imputer = BaggedCategoricalImputer(random_state=42)
>>> imputer.fit_transform(X)
     a    b    c
0  1.0  2.0  4.0
1  2.0  1.0  4.0
2  2.0  2.0  3.0

Methods

fit(X[, y]) Fit the bagged imputer.
fit_transform(X[, y]) Fit the bagged imputer and return the transformed (imputed) matrix.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Impute the test data after fit.
class skutil.preprocessing.BaggedImputer(cols=None, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, as_df=True, fill=None)[source]

Bases: skutil.preprocessing.impute._BaseBaggedImputer

Performs imputation on select columns by using BaggingRegressors on the provided columns.

cols
: array_like, optional (default=None)
The columns on which the transformer will be fit. In the case that cols is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.
base_estimator
: object or None, optional (default=None)
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
n_estimators
: int, optional (default=10)
The number of base estimators in the ensemble.
max_samples
: int or float, optional (default=1.0)
The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
max_features
: int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
bootstrap
: boolean, optional (default=True)
Whether samples are drawn with replacement.
bootstrap_features
: boolean, optional (default=False)
Whether features are drawn with replacement.
oob_score
: bool, optional (default=False)
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs
: int, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
random_state
: int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
verbose
: int, optional (default=0)
Controls the verbosity of the building process.
as_df
: bool, optional (default=True)
Whether to return a Pandas DataFrame in the transform method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.
fill
: int, optional (default=None)
the fill to use for missing values in the training matrix when fitting a BaggingRegressor. If None, will default to -999999
Attributes:

models_ : dict, (string

A dictionary mapping column names to the fit bagged estimator.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from skutil.preprocessing import BaggedImputer
>>>
>>> nan = np.nan
>>> X = pd.DataFrame.from_records(data=np.array([
...                                 [1.0,  nan,  3.1],
...                                 [nan,  2.3,  nan],
...                                 [2.1,  2.1,  3.1]]), 
...                               columns=['a','b','c'])
>>> imputer = BaggedImputer(random_state=42)
>>> imputer.fit_transform(X)
       a     b    c
0  1.000  2.16  3.1
1  1.715  2.30  3.1
2  2.100  2.10  3.1

Methods

fit(X[, y]) Fit the bagged imputer.
fit_transform(X[, y]) Fit the bagged imputer and return the transformed (imputed) matrix.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Impute the test data after fit.
class skutil.preprocessing.BoxCoxTransformer(cols=None, n_jobs=1, as_df=True, shift_amt=1e-06)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

Estimate a lambda parameter for each feature, and transform
it to a distribution more-closely resembling a Gaussian bell using the Box-Cox transformation.
Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

n_jobs : int, 1 by default

The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.

If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

shift_amt : float, optional (default=1e-6)

Since the Box-Cox transformation requires that all values be positive (above zero), any features that contain sub-zero elements will be shifted up by the absolute value of the minimum element plus this amount in the fit method. In the transform method, if any of the test data is less than zero after shifting, it will be truncated at the shift_amt value.

Attributes:

shift_ : dict

The shifts for each feature needed to shift the min value in the feature up to at least 0.0, as every element must be positive

lambda_ : dict

The lambda values corresponding to each feature

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.preprocessing.FunctionMapper(cols=None, fun=None, **kwargs)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

Apply a function to a column or set of columns.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

fun : function, (default=None)

The function to apply to the feature(s). This function will be applied via lambda expression to each column (independent of one another). Therefore, the callable should accept an array-like argument.

Attributes:

is_fit_ : bool

The FunctionMapper callable is set in the constructor, but to remain true to the sklearn API, we need to ensure fit is called prior to transform. Thus, we set this attribute in the fit method, which performs some validation, to ensure the fun parameter has been validated.

Examples

The following example will apply a cube-root transformation to the first two columns in the iris dataset.

>>> from skutil.utils import load_iris_df
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> X = load_iris_df(include_tgt=False)
>>> 
>>> # define the function
>>> def cube_root(x):
...     return np.power(x, 0.333)
>>>
>>> # make our transformer
>>> trans = FunctionMapper(cols=X.columns[:2], fun=cube_root)
>>> trans.fit_transform(X).head()
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0           1.720366          1.517661                1.4               0.2
1           1.697600          1.441722                1.4               0.2
2           1.674205          1.473041                1.3               0.2
3           1.662258          1.457550                1.5               0.2
4           1.709059          1.531965                1.4               0.2

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.preprocessing.ImputerMixin[source]

A mixin for all imputer classes. Contains the default fill value. This mixin is used for the H2O imputer, as well.

Attributes:

_def_fill : int (default=-999999)

The default fill value for NaN values

class skutil.preprocessing.InteractionTermTransformer(cols=None, as_df=True, interaction_function=None, name_suffix='I', only_return_interactions=False)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

A class that will generate interaction terms between selected columns. An interaction captures some relationship between two independent variables in the form of In = (xi * xj).

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

interaction : callable, optional (default=None)

A callable for interactions. Default None will result in multiplication of two Series objects

name_suffix : str, optional (default=’I’)

The suffix to add to the new feature name in the form of <feature_x>_<feature_y>_<suffix>

only_return_interactions : bool, optional (default=False)

If set to True, will only return features in feature_names and their respective generated interaction terms.

Attributes:

fun_ : callable

The interaction term function

Examples

The following example interacts the first two columns of the iris dataset using the default _mul function (product).

>>> from skutil.preprocessing import InteractionTermTransformer
>>> from skutil.utils import load_iris_df
>>> import pandas as pd
>>> 
>>> X = load_iris_df(include_tgt=False)
>>>
>>> trans = InteractionTermTransformer(cols=X.columns[:2])
>>> X_transform = trans.fit_transform(X)
>>>
>>> assert X_transform.shape[1] == X.shape[1] + 1 # only added one column
>>> X_transform[X_transform.columns[-1]].head()
0    17.85
1    14.70
2    15.04
3    14.26
4    18.00
Name: sepal length (cm)_sepal width (cm)_I, dtype: float64

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.preprocessing.OneHotCategoricalEncoder(fill='Missing', as_df=True)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

This class achieves three things: first, it will fill in any NaN values with a provided surrogate (if desired). Second, it will dummy out any categorical features using OneHotEncoding with a safety feature that can handle previously unseen values, and in the transform method will re-append the dummified features to the dataframe. Finally, it will return a numpy ndarray.

Parameters:

fill : str, optional (default = ‘Missing’)

The value that will fill the missing values in the column

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

obj_cols_ : array_like

The list of object-type (categorical) features

lab_encoders_ : array_like

The label encoders

one_hot_ : an instance of a OneHotEncoder

trans_nms_ : the dummified names

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from skutil.preprocessing import OneHotCategoricalEncoder
>>>
>>> X = pd.DataFrame.from_records(data=np.array([
...                                  ['USA','RED','a'],
...                                  ['MEX','GRN','b'],
...                                  ['FRA','RED','b']]), 
...                               columns=['A','B','C'])
>>>
>>> o = OneHotCategoricalEncoder(as_df=True)
>>> o.fit_transform(X)
   A.FRA  A.MEX  A.USA  A.NA  B.GRN  B.RED  B.NA  C.a  C.b  C.NA
0    0.0    0.0    1.0   0.0    0.0    1.0   0.0  1.0  0.0   0.0
1    0.0    1.0    0.0   0.0    1.0    0.0   0.0  0.0  1.0   0.0
2    1.0    0.0    0.0   0.0    0.0    1.0   0.0  0.0  1.0   0.0

Methods

fit(X[, y]) Fit the encoder.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform X, a DataFrame, by stripping out the object columns, dummifying them, and re-appending them to the end.
fit(X, y=None)[source]

Fit the encoder.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the object columns of the dataframe.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform X, a DataFrame, by stripping out the object columns, dummifying them, and re-appending them to the end.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform.

Returns:

x : Pandas DataFrame or np.ndarray, shape=(n_samples, n_features)

The encoded dataframe or array

class skutil.preprocessing.OversamplingClassBalancer(y, ratio=0.2, shuffle=True, as_df=True)[source]

Bases: skutil.preprocessing.balance._BaseBalancer

Oversample all of the minority classes until they are represented at the target proportion to the majority class.

Parameters:

y : str

The name of the response column. The response column must be biclass, no more or less.

ratio : float, optional (default=0.2)

The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix

shuffle : bool, optional (default=True)

Whether or not to shuffle rows on return

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Examples

Consider the following example: with a ratio of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> # 100 zeros, 30 ones and 25 twos
>>> X = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A'])
>>> sampler = OversamplingClassBalancer(y="A", ratio=0.5)
>>>
>>> X_balanced = sampler.balance(X)
>>> X_balanced['A'].value_counts().sort_index()
0.0    100
1.0     50
2.0     50
Name: A, dtype: int64

Methods

balance(X) Apply the oversampling balance operation.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
balance(X)[source]

Apply the oversampling balance operation. Oversamples the minority class to the provided ratio of minority class : majority class

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The data to balance.

Returns:

blnc : pandas DataFrame, shape=(n_samples, n_features)

The balanced dataframe. The dataframe will be explicitly shuffled if self.shuffle is True however, if self.shuffle is False, preservation of original, natural ordering is not guaranteed.

class skutil.preprocessing.SMOTEClassBalancer(y, ratio=0.2, shuffle=True, k=3, as_df=True)[source]

Bases: skutil.preprocessing.balance._BaseBalancer

Balance a matrix with the SMOTE (Synthetic Minority Oversampling TEchnique) method. This will generate synthetic samples for the minority class(es) using K-nearest neighbors

Parameters:

y : str

The name of the response column. The response column must be biclass, no more or less.

ratio : float, optional (default=0.2)

The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix, otherwise SMOTE will impute records until the target ratio is reached.

shuffle : bool, optional (default=True)

Whether or not to shuffle rows on return

k : int, def 3

The number of neighbors to use in the nearest neighbors model

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Examples

Consider the following example: with a ratio of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)

>>> import pandas as pd
>>> import numpy as np
>>> from numpy.random import RandomState
>>>
>>> # establish a random state
>>> prng = RandomState(42)
>>>
>>> # 100 zeros, 30 ones and 25 twos
>>> X = pd.DataFrame(np.asarray([prng.rand(155), 
...                              np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2])]).T,
...                              columns=['x', 'y'])
>>> sampler = SMOTEClassBalancer(y="y", ratio=0.5)
>>>
>>> X_balanced = sampler.balance(X)
>>> X_balanced['y'].value_counts().sort_index()
0.0    100
1.0     50
2.0     50
Name: y, dtype: int64

Methods

balance(X) Apply the SMOTE balancing operation.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
balance(X)[source]

Apply the SMOTE balancing operation. Oversamples the minority class to the provided ratio of minority class : majority class by interpolating points between each sampled point’s k-nearest neighbors.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The data to balance.

Returns:

X : pandas DataFrame, shape=(n_samples, n_features)

The balanced dataframe. The dataframe will be explicitly shuffled if self.shuffle is True however, if self.shuffle is False, preservation of original, natural ordering is not guaranteed.

class skutil.preprocessing.SafeLabelEncoder[source]

Bases: sklearn.preprocessing.label.LabelEncoder

An extension of LabelEncoder that will not throw an exception for unseen data, but will instead return a default value of 99999

Attributes:classes_ : the classes that are encoded

Methods

fit(y) Fit label encoder
fit_transform(y) Fit label encoder and return encoded labels
get_params([deep]) Get parameters for this estimator.
inverse_transform(y) Transform labels back to original encoding.
set_params(\*\*params) Set the parameters of this estimator.
transform(y) Perform encoding if already fit.
transform(y)[source]

Perform encoding if already fit.

Parameters:

y : array_like, shape=(n_samples,)

The array to encode

Returns:

e : array_like, shape=(n_samples,)

The encoded array

exception skutil.preprocessing.SamplingWarning[source]

Bases: exceptions.UserWarning

Custom warning used to notify the user that sub-optimal sampling behavior has occurred. For instance, performing oversampling on a minority class with only one instance will cause this warning to be thrown.

class skutil.preprocessing.SelectiveImputer(cols=None, as_df=True, fill='mean')[source]

Bases: skutil.preprocessing.impute._BaseImputer

A more customizable form on sklearn’s Imputer class. This class can handle more than mean, median or most common... it will also take numeric values. Moreover, it will take a vector of strategies or values with which to impute corresponding columns.

Parameters:

cols : array_like, optional (default=None)

The columns on which the transformer will be fit. In the case that cols is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

fill : int, float, string or array_like, optional (default=None)

the fill to use for missing values in the training matrix when fitting a SelectiveImputer. If None, will default to ‘mean’

Attributes:

fills_ : iterable, int or float

The imputer fill-values

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from skutil.preprocessing import SelectiveImputer
>>>
>>> nan = np.nan
>>> X = pd.DataFrame.from_records(data=np.array([
...                                 [1.0,  nan,  3.1],
...                                 [nan,  2.3,  nan],
...                                 [2.1,  2.1,  3.1]]), 
...                               columns=['a','b','c'])
>>> imputer = SelectiveImputer(fill=['mean', -999, 'mode'])
>>> imputer.fit_transform(X)
      a      b    c
0  1.00 -999.0  3.1
1  1.55    2.3  3.1
2  2.10    2.1  3.1

Methods

fit(X[, y]) Fit the imputer and return the transformed matrix or frame.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a dataframe given the fit imputer.
fit(X, y=None)[source]

Fit the imputer and return the transformed matrix or frame.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a dataframe given the fit imputer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform.

Returns:

X : pd.DataFrame or np.ndarray

The imputed matrix

class skutil.preprocessing.SelectiveScaler(cols=None, scaler=StandardScaler(copy=True, with_mean=True, with_std=True), as_df=True)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

A class that will apply scaling only to a select group of columns. Useful for data that may contain features that should not be scaled, such as those that have been dummied, or for any already-in-scale features. Perhaps, even, there are some features you’d like to scale in a different manner than others. This, then, allows two back-to-back SelectiveScaler instances with different columns & strategies in a pipeline object.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

scaler : instance of a sklearn Scaler, optional (default=StandardScaler)

The scaler to fit against cols. Must be an instance of sklearn.preprocessing.BaseScaler.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

is_fit_ : bool

The SelectiveScaler parameter scaler is set in the constructor, but to remain true to the sklearn API, we need to ensure fit is called prior to transform. Thus, we set this attribute in the fit method, which performs some validation, to ensure the scaler parameter has been validated.

Examples

The following example will scale only the first two features in the iris dataset:

>>> from skutil.preprocessing import SelectiveScaler
>>> from skutil.utils import load_iris_df
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> X = load_iris_df(include_tgt=False)
>>>
>>> trans = SelectiveScaler(cols=X.columns[:2])
>>> X_transform = trans.fit_transform(X)
>>>
>>> X_transform.head()
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.032057                1.4               0.2
1          -1.143017         -0.124958                1.4               0.2
2          -1.385353          0.337848                1.3               0.2
3          -1.506521          0.106445                1.5               0.2
4          -1.021849          1.263460                1.4               0.2

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.preprocessing.SpatialSignTransformer(cols=None, n_jobs=1, as_df=True)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

Project the feature space of a matrix into a multi-dimensional sphere by dividing each feature by its squared norm.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

n_jobs : int, 1 by default

The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.

If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

sq_nms_ : dict

The squared norms for each feature

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.preprocessing.UndersamplingClassBalancer(y, ratio=0.2, shuffle=True, as_df=True)[source]

Bases: skutil.preprocessing.balance._BaseBalancer

Undersample the majority class until it is represented at the target proportion to the most-represented minority class (i.e., the second-most populous class).

Parameters:

y : str

The name of the response column. The response column must be biclass, no more or less.

ratio : float, optional (default=0.2)

The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix

shuffle : bool, optional (default=True)

Whether or not to shuffle rows on return

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Examples

Consider the following example: with a ratio of 0.5, the majority class (0) will be undersampled until the second most-populous class (1) is represented at a ratio of 0.5.

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> # 150 zeros, 30 ones and 10 twos
>>> X = pd.DataFrame(np.concatenate([np.zeros(150), np.ones(30), np.ones(10)*2]), columns=['A'])
>>> sampler = UndersamplingClassBalancer(y="A", ratio=0.5)
>>>
>>> X_balanced = sampler.balance(X)
>>> X_balanced['A'].value_counts().sort_index()
0.0    60
1.0    30
2.0    10
Name: A, dtype: int64

Methods

balance(X) Apply the undersampling balance operation.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
balance(X)[source]

Apply the undersampling balance operation. Undersamples the majority class to the provided ratio over the second-most- populous class label.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The data to balance.

Returns:

blnc : pandas DataFrame, shape=(n_samples, n_features)

The balanced dataframe. The dataframe will be explicitly shuffled if self.shuffle is True however, if self.shuffle is False, preservation of original, natural ordering is not guaranteed.

class skutil.preprocessing.YeoJohnsonTransformer(cols=None, n_jobs=1, as_df=True)[source]

Bases: skutil.base.BaseSkutil, sklearn.base.TransformerMixin

Estimate a lambda parameter for each feature, and transform
it to a distribution more-closely resembling a Gaussian bell using the Yeo-Johnson transformation.
Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

n_jobs : int, 1 by default

The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.

If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

lambda_ : dict

The lambda values corresponding to each feature

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.