skutil.feature_selection module

skutil.feature_selection provides a mechanism by which you can provide an array of columns and subsequently drop columns that are deemed worthy of dropping via the fit method within the _BaseFeatureSelector class. The LinearCombinationFilter class is used to remove linear combinations of features. All public classes within select.py extend the _BaseFeatureSelector class.

class skutil.feature_selection.FeatureDropper(cols=None, as_df=True)[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

A very simple class to be used at the beginning or any stage of a Pipeline that will drop the given features from the remainder of the pipe

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The features to drop. Note that FeatureDropper behaves slightly differently from all other _BaseFeatureSelector classes in the sense that it will drop all of the features prescribed in this parameter. However, if cols is None, it will not drop any (which is counter to other classes, which will operate on all columns in the absence of an explicit cols parameter).

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

Examples

>>> import numpy as np
>>> import pandas as pd
>>>
>>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c'])
>>> dropper = FeatureDropper(cols=['a','b'])
>>> X_transform = dropper.fit_transform(X)
>>> assert X_transform.shape[1] == 1 # drop out first two columns

Methods

fit(X[, y])
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]
class skutil.feature_selection.FeatureRetainer(cols=None, as_df=True)[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

A very simple class to be used at the beginning of a Pipeline that will only propagate the given features throughout the remainder of the pipe

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

Examples

>>> import numpy as np
>>> import pandas as pd
>>>
>>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c'])
>>> dropper = FeatureRetainer(cols=['a','b'])
>>> X_transform = dropper.fit_transform(X)
>>> assert X_transform.shape[1] == 2 # retain first two columns

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

transform(X)[source]

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform. The prescribed drop_ columns will be dropped and a copy of X will be returned.

Returns:

dropped : Pandas DataFrame or np.ndarray, shape=(n_samples, n_features)

The test data with the prescribed drop_ columns removed.

class skutil.feature_selection.LinearCombinationFilterer(cols=None, as_df=True)[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

The ``LinearCombinationFilterer will resolve linear combinations in a numeric matrix. The QR decomposition is used to determine whether the matrix is full rank, and then identify the sets of columns that are involved in the dependencies. This class is adapted from the implementation in the R package, caret.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the cols parameter may result in errors for categorical data.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

Examples

>>> from skutil.utils import load_iris_df
>>>
>>> X = load_iris_df(include_tgt=False)
>>> filterer = LinearCombinationFilterer()
>>> X_transform = filterer.fit_transform(X)
>>> assert X_transform.shape[1] == 4 # no combos in iris...

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit the transformer and return the transformed training array.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

fit_transform(X, y=None)[source]

Fit the transformer and return the transformed training array.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

class skutil.feature_selection.MulticollinearityFilterer(cols=None, threshold=0.85, method='pearson', as_df=True)[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

Filter out features with a correlation greater than the provided threshold. When a pair of correlated features is identified, the mean absolute correlation (MAC) of each feature is considered, and the feature with the highest MAC is discarded.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

threshold : float, optional (default=0.85)

The threshold above which to filter correlated features

method : str, optional (default=’pearson’)

The method used to compute the correlation, one of [‘pearson’,’kendall’,’spearman’].

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

mean_abs_correlations_ : list, float

The corresponding mean absolute correlations of each drop_ name

correlations_ : list of _MCFTuple instances

Contains detailed info on multicollinear columns

Examples

The following demonstrates a simple multicollinearity filterer applied to the iris dataset.

>>> import pandas as pd
>>> from skutil.utils import load_iris_df
>>>
>>> X = load_iris_df(include_tgt=False)
>>> mcf = MulticollinearityFilterer(threshold=0.85)
>>> mcf.fit_transform(X).head()
   sepal length (cm)  sepal width (cm)  petal width (cm)
0                5.1               3.5               0.2
1                4.9               3.0               0.2
2                4.7               3.2               0.2
3                4.6               3.1               0.2
4                5.0               3.6               0.2

Methods

fit(X[, y]) Fit the multicollinearity filterer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the multicollinearity filterer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

class skutil.feature_selection.NearZeroVarianceFilterer(cols=None, threshold=1e-06, as_df=True, strategy='variance')[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

Identify and remove any features that have a variance below a certain threshold. There are two possible strategies for near-zero variance feature selection:

  1. Select features on the basis of the actual variance they exhibit. This is only relevant when the features are real numbers.
  2. Remove features where the ratio of the frequency of the most prevalent value to that of the second-most frequent value is large, say 20 or above (Kuhn & Johnson[1]).
Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

threshold : float, optional (default=1e-6)

The threshold below which to declare “zero variance”

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

strategy : str, optional (default=’variance’)

The strategy by which feature selection should be performed, one of (‘variance’, ‘ratio’). If strategy is ‘variance’, features will be selected based on the amount of variance they exhibit; those that are low-variance (below threshold) will be removed. If strategy is ‘ratio’, features are dropped if the most prevalent value is represented at a ratio greater than or equal to threshold to the second-most frequent value. Note that if strategy is ‘ratio’, threshold must be greater than 1.

Attributes:

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

var_ : dict

The dropped columns mapped to their corresponding variances or ratios, depending on the strategy

References

[R3]Kuhn, M. & Johnson, K. “Applied Predictive Modeling” (2013). New York, NY: Springer.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from skutil.feature_selection import NearZeroVarianceFilterer
>>> 
>>> X = pd.DataFrame.from_records(data=np.array([
...                                 [1,2,3],
...                                 [4,5,3],
...                                 [6,7,3],
...                                 [8,9,3]]), 
...                               columns=['a','b','c'])
>>> filterer = NearZeroVarianceFilterer(threshold=0.05)
>>> filterer.fit_transform(X)
   a  b
0  1  2
1  4  5
2  6  7
3  8  9

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

class skutil.feature_selection.SparseFeatureDropper(cols=None, threshold=0.5, as_df=True)[source]

Bases: skutil.feature_selection.base._BaseFeatureSelector

Retains features that are less sparse (NaN) than the provided threshold. Useful in situations where matrices are too sparse to impute reliably.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

threshold : float, optional (default=0.5)

The threshold of sparsity above which features will be deemed “too sparse” and will be dropped.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

sparsity_ : array_like, shape=(n_features,)

The array of sparsity values

drop_ : array_like, shape=(n_features,)

Assigned after calling fit. These are the features that are designated as “bad” and will be dropped in the transform method.

Examples

>>> import numpy as np
>>> import pandas as pd
>>>
>>> nan = np.nan
>>> X = np.array([
...     [1.0, 2.0, nan],
...     [2.0, 3.0, nan],
...     [3.0, nan, 1.0],
...     [4.0, 5.0, nan]
... ])
>>>
>>> X = pd.DataFrame.from_records(data=X, columns=['a','b','c'])
>>> dropper = SparseFeatureDropper(threshold=0.5)
>>> X_transform = dropper.fit_transform(X)
>>> assert X_transform.shape[1] == 2 # drop out last column

Methods

fit(X[, y]) Fit the transformer.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(\*\*params) Set the parameters of this estimator.
transform(X) Transform a test matrix given the already-fit transformer.
fit(X, y=None)[source]

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

skutil.feature_selection.filter_collinearity(c, threshold)[source]

Performs the collinearity filtration for both the MulticollinearityFilterer as well as the H2OMulticollinearityFilterer

Parameters:

c : pandas DataFrame

The pre-computed correlation matrix. This is expected to be a square matrix, and will raise a ValueError if it’s not.

threshold : float

The threshold above which to filter features which are multicollinear in nature.

Returns:

drops : list (string), shape=(n_features,)

The features that should be dropped

macor : list (float), shape=(n_features,)

The mean absolute correlations between the features.

crrz : list (_MCFTuple), shape=(n_features,)

The tuple containing all information on the collinearity metrics between each pairwise correlation.