skutil.feature_selection module¶
skutil.feature_selection provides a mechanism by which you can provide an array of columns and subsequently drop columns that are deemed worthy of dropping via the fit method within the _BaseFeatureSelector class. The LinearCombinationFilter class is used to remove linear combinations of features. All public classes within select.py extend the _BaseFeatureSelector class.
-
class
skutil.feature_selection.
FeatureDropper
(cols=None, as_df=True)[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
A very simple class to be used at the beginning or any stage of a Pipeline that will drop the given features from the remainder of the pipe
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The features to drop. Note that
FeatureDropper
behaves slightly differently from all other_BaseFeatureSelector
classes in the sense that it will drop all of the features prescribed in this parameter. However, ifcols
is None, it will not drop any (which is counter to other classes, which will operate on all columns in the absence of an explicitcols
parameter).as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.Examples
>>> import numpy as np >>> import pandas as pd >>> >>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c']) >>> dropper = FeatureDropper(cols=['a','b']) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 1 # drop out first two columns
Methods
fit
(X[, y])fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer.
-
class
skutil.feature_selection.
FeatureRetainer
(cols=None, as_df=True)[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
A very simple class to be used at the beginning of a Pipeline that will only propagate the given features throughout the remainder of the pipe
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.Examples
>>> import numpy as np >>> import pandas as pd >>> >>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c']) >>> dropper = FeatureRetainer(cols=['a','b']) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 2 # retain first two columns
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to transform. The prescribed
drop_
columns will be dropped and a copy ofX
will be returned.Returns: dropped : Pandas
DataFrame
or np.ndarray, shape=(n_samples, n_features)The test data with the prescribed
drop_
columns removed.
-
-
class
skutil.feature_selection.
LinearCombinationFilterer
(cols=None, as_df=True)[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
The ``LinearCombinationFilterer will resolve linear combinations in a numeric matrix. The QR decomposition is used to determine whether the matrix is full rank, and then identify the sets of columns that are involved in the dependencies. This class is adapted from the implementation in the R package, caret.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting thecols
parameter may result in errors for categorical data.as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.Examples
>>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> filterer = LinearCombinationFilterer() >>> X_transform = filterer.fit_transform(X) >>> assert X_transform.shape[1] == 4 # no combos in iris...
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit the transformer and return the transformed training array. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
fit_transform
(X, y=None)[source]¶ Fit the transformer and return the transformed training array.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
-
class
skutil.feature_selection.
MulticollinearityFilterer
(cols=None, threshold=0.85, method='pearson', as_df=True)[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
Filter out features with a correlation greater than the provided threshold. When a pair of correlated features is identified, the mean absolute correlation (MAC) of each feature is considered, and the feature with the highest MAC is discarded.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.threshold : float, optional (default=0.85)
The threshold above which to filter correlated features
method : str, optional (default=’pearson’)
The method used to compute the correlation, one of [‘pearson’,’kendall’,’spearman’].
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.mean_abs_correlations_ : list, float
The corresponding mean absolute correlations of each
drop_
namecorrelations_ : list of
_MCFTuple
instancesContains detailed info on multicollinear columns
Examples
The following demonstrates a simple multicollinearity filterer applied to the iris dataset.
>>> import pandas as pd >>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> mcf = MulticollinearityFilterer(threshold=0.85) >>> mcf.fit_transform(X).head() sepal length (cm) sepal width (cm) petal width (cm) 0 5.1 3.5 0.2 1 4.9 3.0 0.2 2 4.7 3.2 0.2 3 4.6 3.1 0.2 4 5.0 3.6 0.2
Methods
fit
(X[, y])Fit the multicollinearity filterer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the multicollinearity filterer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
-
class
skutil.feature_selection.
NearZeroVarianceFilterer
(cols=None, threshold=1e-06, as_df=True, strategy='variance')[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
Identify and remove any features that have a variance below a certain threshold. There are two possible strategies for near-zero variance feature selection:
- Select features on the basis of the actual variance they exhibit. This is only relevant when the features are real numbers.
- Remove features where the ratio of the frequency of the most prevalent value to that of the second-most frequent value is large, say 20 or above (Kuhn & Johnson[1]).
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.threshold : float, optional (default=1e-6)
The threshold below which to declare “zero variance”
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.strategy : str, optional (default=’variance’)
The strategy by which feature selection should be performed, one of (‘variance’, ‘ratio’). If
strategy
is ‘variance’, features will be selected based on the amount of variance they exhibit; those that are low-variance (belowthreshold
) will be removed. Ifstrategy
is ‘ratio’, features are dropped if the most prevalent value is represented at a ratio greater than or equal tothreshold
to the second-most frequent value. Note that ifstrategy
is ‘ratio’,threshold
must be greater than 1.Attributes: drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.var_ : dict
The dropped columns mapped to their corresponding variances or ratios, depending on the
strategy
References
[R3] Kuhn, M. & Johnson, K. “Applied Predictive Modeling” (2013). New York, NY: Springer. Examples
>>> import pandas as pd >>> import numpy as np >>> from skutil.feature_selection import NearZeroVarianceFilterer >>> >>> X = pd.DataFrame.from_records(data=np.array([ ... [1,2,3], ... [4,5,3], ... [6,7,3], ... [8,9,3]]), ... columns=['a','b','c']) >>> filterer = NearZeroVarianceFilterer(threshold=0.05) >>> filterer.fit_transform(X) a b 0 1 2 1 4 5 2 6 7 3 8 9
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
class
skutil.feature_selection.
SparseFeatureDropper
(cols=None, threshold=0.5, as_df=True)[source]¶ Bases:
skutil.feature_selection.base._BaseFeatureSelector
Retains features that are less sparse (NaN) than the provided threshold. Useful in situations where matrices are too sparse to impute reliably.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.threshold : float, optional (default=0.5)
The threshold of sparsity above which features will be deemed “too sparse” and will be dropped.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: sparsity_ : array_like, shape=(n_features,)
The array of sparsity values
drop_ : array_like, shape=(n_features,)
Assigned after calling
fit
. These are the features that are designated as “bad” and will be dropped in thetransform
method.Examples
>>> import numpy as np >>> import pandas as pd >>> >>> nan = np.nan >>> X = np.array([ ... [1.0, 2.0, nan], ... [2.0, 3.0, nan], ... [3.0, nan, 1.0], ... [4.0, 5.0, nan] ... ]) >>> >>> X = pd.DataFrame.from_records(data=X, columns=['a','b','c']) >>> dropper = SparseFeatureDropper(threshold=0.5) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 2 # drop out last column
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
-
skutil.feature_selection.
filter_collinearity
(c, threshold)[source]¶ Performs the collinearity filtration for both the
MulticollinearityFilterer
as well as theH2OMulticollinearityFilterer
Parameters: c : pandas
DataFrame
The pre-computed correlation matrix. This is expected to be a square matrix, and will raise a
ValueError
if it’s not.threshold : float
The threshold above which to filter features which are multicollinear in nature.
Returns: drops : list (string), shape=(n_features,)
The features that should be dropped
macor : list (float), shape=(n_features,)
The mean absolute correlations between the features.
crrz : list (_MCFTuple), shape=(n_features,)
The tuple containing all information on the collinearity metrics between each pairwise correlation.