skutil.preprocessing module¶
Provides sklearn-esque transformer classes including the Box-Cox transformation and the Yeo-Johnson transformation. Also includes selective scalers and other transformers.
-
class
skutil.preprocessing.
BaggedCategoricalImputer
(cols=None, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, as_df=True, fill=None)[source]¶ Bases:
skutil.preprocessing.impute._BaseBaggedImputer
Performs imputation on select columns by using BaggingRegressors on the provided columns.
- cols : array_like, optional (default=None)
- The columns on which the transformer will be
fit
. In the case thatcols
is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data. - base_estimator : object or None, optional (default=None)
- The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
- n_estimators : int, optional (default=10)
- The number of base estimators in the ensemble.
- max_samples : int or float, optional (default=1.0)
- The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
- max_features : int or float, optional (default=1.0)
- The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
- bootstrap : boolean, optional (default=True)
- Whether samples are drawn with replacement.
- bootstrap_features : boolean, optional (default=False)
- Whether features are drawn with replacement.
- oob_score : bool, optional (default=False)
- Whether to use out-of-bag samples to estimate the generalization error.
- n_jobs : int, optional (default=1)
- The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
- random_state : int, RandomState instance or None, optional (default=None)
- If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- verbose : int, optional (default=0)
- Controls the verbosity of the building process.
- as_df : bool, optional (default=True)
- Whether to return a Pandas DataFrame in the
transform
method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, theas_df
parameter is True by default. - fill : int, optional (default=None)
- the fill to use for missing values in the training matrix when fitting a BaggingClassifier. If None, will default to -999999
Attributes: models_ : dict, (string
A dictionary mapping column names to the fit bagged estimator.
Examples
>>> import numpy as np >>> import pandas as pd >>> from skutil.preprocessing import BaggedCategoricalImputer >>> >>> nan = np.nan >>> X = pd.DataFrame.from_records(data=np.array([ ... [1.0, nan, 4.0], ... [nan, 1.0, nan], ... [2.0, 2.0, 3.0]]), ... columns=['a','b','c']) >>> imputer = BaggedCategoricalImputer(random_state=42) >>> imputer.fit_transform(X) a b c 0 1.0 2.0 4.0 1 2.0 1.0 4.0 2 2.0 2.0 3.0
Methods
fit
(X[, y])Fit the bagged imputer. fit_transform
(X[, y])Fit the bagged imputer and return the transformed (imputed) matrix. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Impute the test data after fit.
-
class
skutil.preprocessing.
BaggedImputer
(cols=None, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, as_df=True, fill=None)[source]¶ Bases:
skutil.preprocessing.impute._BaseBaggedImputer
Performs imputation on select columns by using BaggingRegressors on the provided columns.
- cols : array_like, optional (default=None)
- The columns on which the transformer will be
fit
. In the case thatcols
is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data. - base_estimator : object or None, optional (default=None)
- The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
- n_estimators : int, optional (default=10)
- The number of base estimators in the ensemble.
- max_samples : int or float, optional (default=1.0)
- The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
- max_features : int or float, optional (default=1.0)
- The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
- bootstrap : boolean, optional (default=True)
- Whether samples are drawn with replacement.
- bootstrap_features : boolean, optional (default=False)
- Whether features are drawn with replacement.
- oob_score : bool, optional (default=False)
- Whether to use out-of-bag samples to estimate the generalization error.
- n_jobs : int, optional (default=1)
- The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
- random_state : int, RandomState instance or None, optional (default=None)
- If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- verbose : int, optional (default=0)
- Controls the verbosity of the building process.
- as_df : bool, optional (default=True)
- Whether to return a Pandas DataFrame in the
transform
method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, theas_df
parameter is True by default. - fill : int, optional (default=None)
- the fill to use for missing values in the training matrix when fitting a BaggingRegressor. If None, will default to -999999
Attributes: models_ : dict, (string
A dictionary mapping column names to the fit bagged estimator.
Examples
>>> import numpy as np >>> import pandas as pd >>> from skutil.preprocessing import BaggedImputer >>> >>> nan = np.nan >>> X = pd.DataFrame.from_records(data=np.array([ ... [1.0, nan, 3.1], ... [nan, 2.3, nan], ... [2.1, 2.1, 3.1]]), ... columns=['a','b','c']) >>> imputer = BaggedImputer(random_state=42) >>> imputer.fit_transform(X) a b c 0 1.000 2.16 3.1 1 1.715 2.30 3.1 2 2.100 2.10 3.1
Methods
fit
(X[, y])Fit the bagged imputer. fit_transform
(X[, y])Fit the bagged imputer and return the transformed (imputed) matrix. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Impute the test data after fit.
-
class
skutil.preprocessing.
BoxCoxTransformer
(cols=None, n_jobs=1, as_df=True, shift_amt=1e-06)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
- Estimate a lambda parameter for each feature, and transform
- it to a distribution more-closely resembling a Gaussian bell using the Box-Cox transformation.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.n_jobs : int, 1 by default
The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.shift_amt : float, optional (default=1e-6)
Since the Box-Cox transformation requires that all values be positive (above zero), any features that contain sub-zero elements will be shifted up by the absolute value of the minimum element plus this amount in the
fit
method. In thetransform
method, if any of the test data is less than zero after shifting, it will be truncated at theshift_amt
value.Attributes: shift_ : dict
The shifts for each feature needed to shift the min value in the feature up to at least 0.0, as every element must be positive
lambda_ : dict
The lambda values corresponding to each feature
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
class
skutil.preprocessing.
FunctionMapper
(cols=None, fun=None, **kwargs)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
Apply a function to a column or set of columns.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.fun : function, (default=None)
The function to apply to the feature(s). This function will be applied via lambda expression to each column (independent of one another). Therefore, the callable should accept an array-like argument.
Attributes: is_fit_ : bool
The
FunctionMapper
callable is set in the constructor, but to remain true to the sklearn API, we need to ensurefit
is called prior totransform
. Thus, we set this attribute in thefit
method, which performs some validation, to ensure thefun
parameter has been validated.Examples
The following example will apply a cube-root transformation to the first two columns in the iris dataset.
>>> from skutil.utils import load_iris_df >>> import pandas as pd >>> import numpy as np >>> >>> X = load_iris_df(include_tgt=False) >>> >>> # define the function >>> def cube_root(x): ... return np.power(x, 0.333) >>> >>> # make our transformer >>> trans = FunctionMapper(cols=X.columns[:2], fun=cube_root) >>> trans.fit_transform(X).head() sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 1.720366 1.517661 1.4 0.2 1 1.697600 1.441722 1.4 0.2 2 1.674205 1.473041 1.3 0.2 3 1.662258 1.457550 1.5 0.2 4 1.709059 1.531965 1.4 0.2
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
-
class
skutil.preprocessing.
ImputerMixin
[source]¶ A mixin for all imputer classes. Contains the default fill value. This mixin is used for the H2O imputer, as well.
Attributes: _def_fill : int (default=-999999)
The default fill value for NaN values
-
class
skutil.preprocessing.
InteractionTermTransformer
(cols=None, as_df=True, interaction_function=None, name_suffix='I', only_return_interactions=False)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
A class that will generate interaction terms between selected columns. An interaction captures some relationship between two independent variables in the form of In = (xi * xj).
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.interaction : callable, optional (default=None)
A callable for interactions. Default None will result in multiplication of two Series objects
name_suffix : str, optional (default=’I’)
The suffix to add to the new feature name in the form of <feature_x>_<feature_y>_<suffix>
only_return_interactions : bool, optional (default=False)
If set to True, will only return features in feature_names and their respective generated interaction terms.
Attributes: fun_ : callable
The interaction term function
Examples
The following example interacts the first two columns of the iris dataset using the default
_mul
function (product).>>> from skutil.preprocessing import InteractionTermTransformer >>> from skutil.utils import load_iris_df >>> import pandas as pd >>> >>> X = load_iris_df(include_tgt=False) >>> >>> trans = InteractionTermTransformer(cols=X.columns[:2]) >>> X_transform = trans.fit_transform(X) >>> >>> assert X_transform.shape[1] == X.shape[1] + 1 # only added one column >>> X_transform[X_transform.columns[-1]].head() 0 17.85 1 14.70 2 15.04 3 14.26 4 18.00 Name: sepal length (cm)_sepal width (cm)_I, dtype: float64
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
-
class
skutil.preprocessing.
OneHotCategoricalEncoder
(fill='Missing', as_df=True)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
This class achieves three things: first, it will fill in any NaN values with a provided surrogate (if desired). Second, it will dummy out any categorical features using OneHotEncoding with a safety feature that can handle previously unseen values, and in the transform method will re-append the dummified features to the dataframe. Finally, it will return a numpy ndarray.
Parameters: fill : str, optional (default = ‘Missing’)
The value that will fill the missing values in the column
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: obj_cols_ : array_like
The list of object-type (categorical) features
lab_encoders_ : array_like
The label encoders
one_hot_ : an instance of a OneHotEncoder
trans_nms_ : the dummified names
Examples
>>> import pandas as pd >>> import numpy as np >>> from skutil.preprocessing import OneHotCategoricalEncoder >>> >>> X = pd.DataFrame.from_records(data=np.array([ ... ['USA','RED','a'], ... ['MEX','GRN','b'], ... ['FRA','RED','b']]), ... columns=['A','B','C']) >>> >>> o = OneHotCategoricalEncoder(as_df=True) >>> o.fit_transform(X) A.FRA A.MEX A.USA A.NA B.GRN B.RED B.NA C.a C.b C.NA 0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 2 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
Methods
fit
(X[, y])Fit the encoder. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform X, a DataFrame, by stripping out the object columns, dummifying them, and re-appending them to the end. -
fit
(X, y=None)[source]¶ Fit the encoder.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the object columns of the dataframe.
y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform X, a DataFrame, by stripping out the object columns, dummifying them, and re-appending them to the end.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to transform.
Returns: x : Pandas
DataFrame
or np.ndarray, shape=(n_samples, n_features)The encoded dataframe or array
-
-
class
skutil.preprocessing.
OversamplingClassBalancer
(y, ratio=0.2, shuffle=True, as_df=True)[source]¶ Bases:
skutil.preprocessing.balance._BaseBalancer
Oversample all of the minority classes until they are represented at the target proportion to the majority class.
Parameters: y : str
The name of the response column. The response column must be biclass, no more or less.
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix
shuffle : bool, optional (default=True)
Whether or not to shuffle rows on return
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Examples
Consider the following example: with a
ratio
of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)>>> import pandas as pd >>> import numpy as np >>> >>> # 100 zeros, 30 ones and 25 twos >>> X = pd.DataFrame(np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2]), columns=['A']) >>> sampler = OversamplingClassBalancer(y="A", ratio=0.5) >>> >>> X_balanced = sampler.balance(X) >>> X_balanced['A'].value_counts().sort_index() 0.0 100 1.0 50 2.0 50 Name: A, dtype: int64
Methods
balance
(X)Apply the oversampling balance operation. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. -
balance
(X)[source]¶ Apply the oversampling balance operation. Oversamples the minority class to the provided ratio of minority class : majority class
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The data to balance.
Returns: blnc : pandas
DataFrame
, shape=(n_samples, n_features)The balanced dataframe. The dataframe will be explicitly shuffled if
self.shuffle
is True however, ifself.shuffle
is False, preservation of original, natural ordering is not guaranteed.
-
-
class
skutil.preprocessing.
SMOTEClassBalancer
(y, ratio=0.2, shuffle=True, k=3, as_df=True)[source]¶ Bases:
skutil.preprocessing.balance._BaseBalancer
Balance a matrix with the SMOTE (Synthetic Minority Oversampling TEchnique) method. This will generate synthetic samples for the minority class(es) using K-nearest neighbors
Parameters: y : str
The name of the response column. The response column must be biclass, no more or less.
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix, otherwise SMOTE will impute records until the target ratio is reached.
shuffle : bool, optional (default=True)
Whether or not to shuffle rows on return
k : int, def 3
The number of neighbors to use in the nearest neighbors model
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Examples
Consider the following example: with a
ratio
of 0.5, the minority classes (1, 2) will be oversampled until they are represented at a ratio of at least 0.5 * the prevalence of the majority class (0)>>> import pandas as pd >>> import numpy as np >>> from numpy.random import RandomState >>> >>> # establish a random state >>> prng = RandomState(42) >>> >>> # 100 zeros, 30 ones and 25 twos >>> X = pd.DataFrame(np.asarray([prng.rand(155), ... np.concatenate([np.zeros(100), np.ones(30), np.ones(25)*2])]).T, ... columns=['x', 'y']) >>> sampler = SMOTEClassBalancer(y="y", ratio=0.5) >>> >>> X_balanced = sampler.balance(X) >>> X_balanced['y'].value_counts().sort_index() 0.0 100 1.0 50 2.0 50 Name: y, dtype: int64
Methods
balance
(X)Apply the SMOTE balancing operation. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. -
balance
(X)[source]¶ Apply the SMOTE balancing operation. Oversamples the minority class to the provided ratio of minority class : majority class by interpolating points between each sampled point’s k-nearest neighbors.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The data to balance.
Returns: X : pandas
DataFrame
, shape=(n_samples, n_features)The balanced dataframe. The dataframe will be explicitly shuffled if
self.shuffle
is True however, ifself.shuffle
is False, preservation of original, natural ordering is not guaranteed.
-
-
class
skutil.preprocessing.
SafeLabelEncoder
[source]¶ Bases:
sklearn.preprocessing.label.LabelEncoder
An extension of LabelEncoder that will not throw an exception for unseen data, but will instead return a default value of 99999
Attributes: classes_ : the classes that are encoded Methods
fit
(y)Fit label encoder fit_transform
(y)Fit label encoder and return encoded labels get_params
([deep])Get parameters for this estimator. inverse_transform
(y)Transform labels back to original encoding. set_params
(\*\*params)Set the parameters of this estimator. transform
(y)Perform encoding if already fit.
-
exception
skutil.preprocessing.
SamplingWarning
[source]¶ Bases:
exceptions.UserWarning
Custom warning used to notify the user that sub-optimal sampling behavior has occurred. For instance, performing oversampling on a minority class with only one instance will cause this warning to be thrown.
-
class
skutil.preprocessing.
SelectiveImputer
(cols=None, as_df=True, fill='mean')[source]¶ Bases:
skutil.preprocessing.impute._BaseImputer
A more customizable form on sklearn’s
Imputer
class. This class can handle more than mean, median or most common... it will also take numeric values. Moreover, it will take a vector of strategies or values with which to impute corresponding columns.Parameters: cols : array_like, optional (default=None)
The columns on which the transformer will be
fit
. In the case thatcols
is None, the transformer will be fit on all columns. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.as_df : bool, optional (default=True)
Whether to return a Pandas DataFrame in the
transform
method. If False, will return a NumPy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, theas_df
parameter is True by default.fill : int, float, string or array_like, optional (default=None)
the fill to use for missing values in the training matrix when fitting a
SelectiveImputer
. If None, will default to ‘mean’Attributes: fills_ : iterable, int or float
The imputer fill-values
Examples
>>> import numpy as np >>> import pandas as pd >>> from skutil.preprocessing import SelectiveImputer >>> >>> nan = np.nan >>> X = pd.DataFrame.from_records(data=np.array([ ... [1.0, nan, 3.1], ... [nan, 2.3, nan], ... [2.1, 2.1, 3.1]]), ... columns=['a','b','c']) >>> imputer = SelectiveImputer(fill=['mean', -999, 'mode']) >>> imputer.fit_transform(X) a b c 0 1.00 -999.0 3.1 1 1.55 2.3 3.1 2 2.10 2.1 3.1
Methods
fit
(X[, y])Fit the imputer and return the transformed matrix or frame. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a dataframe given the fit imputer. -
fit
(X, y=None)[source]¶ Fit the imputer and return the transformed matrix or frame.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
-
class
skutil.preprocessing.
SelectiveScaler
(cols=None, scaler=StandardScaler(copy=True, with_mean=True, with_std=True), as_df=True)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
A class that will apply scaling only to a select group of columns. Useful for data that may contain features that should not be scaled, such as those that have been dummied, or for any already-in-scale features. Perhaps, even, there are some features you’d like to scale in a different manner than others. This, then, allows two back-to-back
SelectiveScaler
instances with different columns & strategies in a pipeline object.Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.scaler : instance of a sklearn Scaler, optional (default=StandardScaler)
The scaler to fit against
cols
. Must be an instance ofsklearn.preprocessing.BaseScaler
.as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: is_fit_ : bool
The
SelectiveScaler
parameterscaler
is set in the constructor, but to remain true to the sklearn API, we need to ensurefit
is called prior totransform
. Thus, we set this attribute in thefit
method, which performs some validation, to ensure thescaler
parameter has been validated.Examples
The following example will scale only the first two features in the iris dataset:
>>> from skutil.preprocessing import SelectiveScaler >>> from skutil.utils import load_iris_df >>> import pandas as pd >>> import numpy as np >>> >>> X = load_iris_df(include_tgt=False) >>> >>> trans = SelectiveScaler(cols=X.columns[:2]) >>> X_transform = trans.fit_transform(X) >>> >>> X_transform.head() sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 -0.900681 1.032057 1.4 0.2 1 -1.143017 -0.124958 1.4 0.2 2 -1.385353 0.337848 1.3 0.2 3 -1.506521 0.106445 1.5 0.2 4 -1.021849 1.263460 1.4 0.2
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
-
class
skutil.preprocessing.
SpatialSignTransformer
(cols=None, n_jobs=1, as_df=True)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
Project the feature space of a matrix into a multi-dimensional sphere by dividing each feature by its squared norm.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.n_jobs : int, 1 by default
The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: sq_nms_ : dict
The squared norms for each feature
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
-
class
skutil.preprocessing.
UndersamplingClassBalancer
(y, ratio=0.2, shuffle=True, as_df=True)[source]¶ Bases:
skutil.preprocessing.balance._BaseBalancer
Undersample the majority class until it is represented at the target proportion to the most-represented minority class (i.e., the second-most populous class).
Parameters: y : str
The name of the response column. The response column must be biclass, no more or less.
ratio : float, optional (default=0.2)
The target ratio of the minority records to the majority records. If the existing ratio is >= the provided ratio, the return value will merely be a copy of the input matrix
shuffle : bool, optional (default=True)
Whether or not to shuffle rows on return
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Examples
Consider the following example: with a
ratio
of 0.5, the majority class (0) will be undersampled until the second most-populous class (1) is represented at a ratio of 0.5.>>> import pandas as pd >>> import numpy as np >>> >>> # 150 zeros, 30 ones and 10 twos >>> X = pd.DataFrame(np.concatenate([np.zeros(150), np.ones(30), np.ones(10)*2]), columns=['A']) >>> sampler = UndersamplingClassBalancer(y="A", ratio=0.5) >>> >>> X_balanced = sampler.balance(X) >>> X_balanced['A'].value_counts().sort_index() 0.0 60 1.0 30 2.0 10 Name: A, dtype: int64
Methods
balance
(X)Apply the undersampling balance operation. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. -
balance
(X)[source]¶ Apply the undersampling balance operation. Undersamples the majority class to the provided ratio over the second-most- populous class label.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The data to balance.
Returns: blnc : pandas
DataFrame
, shape=(n_samples, n_features)The balanced dataframe. The dataframe will be explicitly shuffled if
self.shuffle
is True however, ifself.shuffle
is False, preservation of original, natural ordering is not guaranteed.
-
-
class
skutil.preprocessing.
YeoJohnsonTransformer
(cols=None, n_jobs=1, as_df=True)[source]¶ Bases:
skutil.base.BaseSkutil
,sklearn.base.TransformerMixin
- Estimate a lambda parameter for each feature, and transform
- it to a distribution more-closely resembling a Gaussian bell using the Yeo-Johnson transformation.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.n_jobs : int, 1 by default
The number of jobs to use for the computation. This works by estimating each of the feature lambdas in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: lambda_ : dict
The lambda values corresponding to each feature
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.