skutil.feature_selection module¶
skutil.feature_selection provides a mechanism by which you can provide an array of columns and subsequently drop columns that are deemed worthy of dropping via the fit method within the _BaseFeatureSelector class. The LinearCombinationFilter class is used to remove linear combinations of features. All public classes within select.py extend the _BaseFeatureSelector class.
- 
class skutil.feature_selection.FeatureDropper(cols=None, as_df=True)[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- A very simple class to be used at the beginning or any stage of a Pipeline that will drop the given features from the remainder of the pipe - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The features to drop. Note that - FeatureDropperbehaves slightly differently from all other- _BaseFeatureSelectorclasses in the sense that it will drop all of the features prescribed in this parameter. However, if- colsis None, it will not drop any (which is counter to other classes, which will operate on all columns in the absence of an explicit- colsparameter).- as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- Attributes: - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- Examples - >>> import numpy as np >>> import pandas as pd >>> >>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c']) >>> dropper = FeatureDropper(cols=['a','b']) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 1 # drop out first two columns - Methods - fit(X[, y])- fit_transform(X[, y])- Fit to data, then transform it. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. 
- 
class skutil.feature_selection.FeatureRetainer(cols=None, as_df=True)[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- A very simple class to be used at the beginning of a Pipeline that will only propagate the given features throughout the remainder of the pipe - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be - fiton the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.- as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- Attributes: - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- Examples - >>> import numpy as np >>> import pandas as pd >>> >>> X = pd.DataFrame.from_records(data=np.random.rand(3,3), columns=['a','b','c']) >>> dropper = FeatureRetainer(cols=['a','b']) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 2 # retain first two columns - Methods - fit(X[, y])- Fit the transformer. - fit_transform(X[, y])- Fit to data, then transform it. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. - 
fit(X, y=None)[source]¶
- Fit the transformer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 - 
transform(X)[source]¶
- Transform a test matrix given the already-fit transformer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to transform. The prescribed - drop_columns will be dropped and a copy of- Xwill be returned.- Returns: - dropped : Pandas - DataFrameor np.ndarray, shape=(n_samples, n_features)- The test data with the prescribed - drop_columns removed.
 
- 
- 
class skutil.feature_selection.LinearCombinationFilterer(cols=None, as_df=True)[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- The ``LinearCombinationFilterer will resolve linear combinations in a numeric matrix. The QR decomposition is used to determine whether the matrix is full rank, and then identify the sets of columns that are involved in the dependencies. This class is adapted from the implementation in the R package, caret. - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be - fiton the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation. Note that since this transformer can only operate on numeric columns, not explicitly setting the- colsparameter may result in errors for categorical data.- as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- Attributes: - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- Examples - >>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> filterer = LinearCombinationFilterer() >>> X_transform = filterer.fit_transform(X) >>> assert X_transform.shape[1] == 4 # no combos in iris... - Methods - fit(X[, y])- Fit the transformer. - fit_transform(X[, y])- Fit the transformer and return the transformed training array. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. - 
fit(X, y=None)[source]¶
- Fit the transformer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 - 
fit_transform(X, y=None)[source]¶
- Fit the transformer and return the transformed training array. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 
- 
- 
class skutil.feature_selection.MulticollinearityFilterer(cols=None, threshold=0.85, method='pearson', as_df=True)[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- Filter out features with a correlation greater than the provided threshold. When a pair of correlated features is identified, the mean absolute correlation (MAC) of each feature is considered, and the feature with the highest MAC is discarded. - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be - fiton the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.- threshold : float, optional (default=0.85) - The threshold above which to filter correlated features - method : str, optional (default=’pearson’) - The method used to compute the correlation, one of [‘pearson’,’kendall’,’spearman’]. - as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- Attributes: - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- mean_abs_correlations_ : list, float - The corresponding mean absolute correlations of each - drop_name- correlations_ : list of - _MCFTupleinstances- Contains detailed info on multicollinear columns - Examples - The following demonstrates a simple multicollinearity filterer applied to the iris dataset. - >>> import pandas as pd >>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> mcf = MulticollinearityFilterer(threshold=0.85) >>> mcf.fit_transform(X).head() sepal length (cm) sepal width (cm) petal width (cm) 0 5.1 3.5 0.2 1 4.9 3.0 0.2 2 4.7 3.2 0.2 3 4.6 3.1 0.2 4 5.0 3.6 0.2 - Methods - fit(X[, y])- Fit the multicollinearity filterer. - fit_transform(X[, y])- Fit to data, then transform it. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. - 
fit(X, y=None)[source]¶
- Fit the multicollinearity filterer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 
- 
- 
class skutil.feature_selection.NearZeroVarianceFilterer(cols=None, threshold=1e-06, as_df=True, strategy='variance')[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- Identify and remove any features that have a variance below a certain threshold. There are two possible strategies for near-zero variance feature selection: - Select features on the basis of the actual variance they exhibit. This is only relevant when the features are real numbers.
- Remove features where the ratio of the frequency of the most prevalent value to that of the second-most frequent value is large, say 20 or above (Kuhn & Johnson[1]).
 - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be - fiton the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.- threshold : float, optional (default=1e-6) - The threshold below which to declare “zero variance” - as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- strategy : str, optional (default=’variance’) - The strategy by which feature selection should be performed, one of (‘variance’, ‘ratio’). If - strategyis ‘variance’, features will be selected based on the amount of variance they exhibit; those that are low-variance (below- threshold) will be removed. If- strategyis ‘ratio’, features are dropped if the most prevalent value is represented at a ratio greater than or equal to- thresholdto the second-most frequent value. Note that if- strategyis ‘ratio’,- thresholdmust be greater than 1.- Attributes: - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- var_ : dict - The dropped columns mapped to their corresponding variances or ratios, depending on the - strategy- References - [R3] - Kuhn, M. & Johnson, K. “Applied Predictive Modeling” (2013). New York, NY: Springer. - Examples - >>> import pandas as pd >>> import numpy as np >>> from skutil.feature_selection import NearZeroVarianceFilterer >>> >>> X = pd.DataFrame.from_records(data=np.array([ ... [1,2,3], ... [4,5,3], ... [6,7,3], ... [8,9,3]]), ... columns=['a','b','c']) >>> filterer = NearZeroVarianceFilterer(threshold=0.05) >>> filterer.fit_transform(X) a b 0 1 2 1 4 5 2 6 7 3 8 9 - Methods - fit(X[, y])- Fit the transformer. - fit_transform(X[, y])- Fit to data, then transform it. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. - 
fit(X, y=None)[source]¶
- Fit the transformer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 
- 
class skutil.feature_selection.SparseFeatureDropper(cols=None, threshold=0.5, as_df=True)[source]¶
- Bases: - skutil.feature_selection.base._BaseFeatureSelector- Retains features that are less sparse (NaN) than the provided threshold. Useful in situations where matrices are too sparse to impute reliably. - Parameters: - cols : array_like, shape=(n_features,), optional (default=None) - The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be - fiton the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.- threshold : float, optional (default=0.5) - The threshold of sparsity above which features will be deemed “too sparse” and will be dropped. - as_df : bool, optional (default=True) - Whether to return a Pandas - DataFramein the- transformmethod. If False, will return a Numpy- ndarrayinstead. Since most skutil transformers depend on explicitly-named- DataFramefeatures, the- as_dfparameter is True by default.- Attributes: - sparsity_ : array_like, shape=(n_features,) - The array of sparsity values - drop_ : array_like, shape=(n_features,) - Assigned after calling - fit. These are the features that are designated as “bad” and will be dropped in the- transformmethod.- Examples - >>> import numpy as np >>> import pandas as pd >>> >>> nan = np.nan >>> X = np.array([ ... [1.0, 2.0, nan], ... [2.0, 3.0, nan], ... [3.0, nan, 1.0], ... [4.0, 5.0, nan] ... ]) >>> >>> X = pd.DataFrame.from_records(data=X, columns=['a','b','c']) >>> dropper = SparseFeatureDropper(threshold=0.5) >>> X_transform = dropper.fit_transform(X) >>> assert X_transform.shape[1] == 2 # drop out last column - Methods - fit(X[, y])- Fit the transformer. - fit_transform(X[, y])- Fit to data, then transform it. - get_params([deep])- Get parameters for this estimator. - set_params(\*\*params)- Set the parameters of this estimator. - transform(X)- Transform a test matrix given the already-fit transformer. - 
fit(X, y=None)[source]¶
- Fit the transformer. - Parameters: - X : Pandas - DataFrame, shape=(n_samples, n_features)- The Pandas frame to fit. The frame will only be fit on the prescribed - cols(see- __init__) or all of them if- colsis None. Furthermore,- Xwill not be altered in the process of the fit.- y : None - Passthrough for - sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of- fit.- Returns: - self : 
 
- 
- 
skutil.feature_selection.filter_collinearity(c, threshold)[source]¶
- Performs the collinearity filtration for both the - MulticollinearityFiltereras well as the- H2OMulticollinearityFilterer- Parameters: - c : pandas - DataFrame- The pre-computed correlation matrix. This is expected to be a square matrix, and will raise a - ValueErrorif it’s not.- threshold : float - The threshold above which to filter features which are multicollinear in nature. - Returns: - drops : list (string), shape=(n_features,) - The features that should be dropped - macor : list (float), shape=(n_features,) - The mean absolute correlations between the features. - crrz : list (_MCFTuple), shape=(n_features,) - The tuple containing all information on the collinearity metrics between each pairwise correlation.