skutil.decomposition module¶
skutil.decomposition provides sklearn decompositions (PCA, TruncatedSVD) within the skutil API, i.e., allowing such transformers to operate on a select subset of columns rather than the entire matrix.
-
class
skutil.decomposition.
SelectivePCA
(cols=None, n_components=None, whiten=False, weight=False, as_df=True)[source]¶ Bases:
skutil.decomposition.decompose._BaseSelectiveDecomposer
A class that will apply PCA only to a select group of columns. Useful for data that may contain a mix of columns that we do and don’t want to decompose.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.n_components : int, float, None or string, optional (default=None)
The number of components to keep, per sklearn:
if n_components is not set, all components are kept:
n_components == min(n_samples, n_features)
if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used
to guess the dimension.
- if
0 < n_components < 1
and svd_solver == ‘full’, select the number
of components such that the amount of variance that needs to be explained is greater than the percentage specified by
n_components
n_components
cannot be equal ton_features
forsvd_solver
== ‘arpack’.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.whiten : bool, optional (default False)
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
weight : bool, optional (default False)
When True (False by default) the explained_variance_ vector is used to weight the features post-transformation. This is especially useful in clustering contexts, where features are all implicitly assigned the same importance, even though PCA by nature orders the features by importance (i.e., not all components are created equally). When True, weighting will subtract the median variance from the weighting vector, and add one (so as not to down sample or upsample everything), then multiply the weights across the transformed features.
Attributes: pca_ : the PCA object
Examples
>>> from skutil.decomposition import SelectivePCA >>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> pca = SelectivePCA(n_components=2) >>> X_transform = pca.fit_transform(X) # pca suffers sign indeterminancy and results will vary >>> assert X_transform.shape[1] == 2
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_decomposition
()Overridden from the :class: skutil.decomposition.decompose._BaseSelectiveDecomposer
class,get_params
([deep])Get parameters for this estimator. inverse_transform
(X)Given a transformed dataframe, inverse the transformation. score
(X[, y])Return the average log-likelihood of all samples. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
get_decomposition
()[source]¶ Overridden from the :class:
skutil.decomposition.decompose._BaseSelectiveDecomposer
class, this method returns the internal decomposition class:sklearn.decomposition.PCA
Returns: self.pca_ :
sklearn.decomposition.PCA
The fit internal decomposition class
-
score
(X, y=None)[source]¶ Return the average log-likelihood of all samples. This calls sklearn.decomposition.PCA’s score method on the specified columns [1].
Parameters: X: Pandas ``DataFrame``, shape=(n_samples, n_features) :
The data to score.
y: None :
Passthrough for pipeline/gridsearch
Returns: ll: float :
Average log-likelihood of the samples under the fit PCA model (self.pca_)
References
[R1] Bishop, C. “Pattern Recognition and Machine Learning” 12.2.1 p. 574 http://www.miketipping.com/papers/met-mppca.pdf
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
The operation is applied to a copy of
X
, and the result set is returned.
-
class
skutil.decomposition.
SelectiveTruncatedSVD
(cols=None, n_components=2, algorithm='randomized', n_iter=5, as_df=True)[source]¶ Bases:
skutil.decomposition.decompose._BaseSelectiveDecomposer
A class that will apply truncated SVD (LSA) only to a select group of columns. Useful for data that contains categorical features that have not yet been dummied, or for dummied features we don’t want decomposed. TruncatedSVD is the equivalent of Latent Semantic Analysis, and returns the “concept space” of the decomposed features.
Parameters: cols : array_like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be
fit
on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.n_components : int, (default=2)
Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
algorithm : string, (default=”randomized”)
SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).
n_iter : int, optional (default=5)
Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead. Since most skutil transformers depend on explicitly-namedDataFrame
features, theas_df
parameter is True by default.Attributes: svd_ : the SVD object
Examples
>>> from skutil.decomposition import SelectiveTruncatedSVD >>> from skutil.utils import load_iris_df >>> >>> X = load_iris_df(include_tgt=False) >>> svd = SelectiveTruncatedSVD(n_components=2) >>> X_transform = svd.fit_transform(X) # svd suffers sign indeterminancy and results will vary >>> assert X_transform.shape[1] == 2
Methods
fit
(X[, y])Fit the transformer. fit_transform
(X[, y])Fit to data, then transform it. get_decomposition
()Overridden from the :class: skutil.decomposition.decompose._BaseSelectiveDecomposer
class,get_params
([deep])Get parameters for this estimator. inverse_transform
(X)Given a transformed dataframe, inverse the transformation. set_params
(\*\*params)Set the parameters of this estimator. transform
(X)Transform a test matrix given the already-fit transformer. -
fit
(X, y=None)[source]¶ Fit the transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None. Furthermore,X
will not be altered in the process of the fit.y : None
Passthrough for
sklearn.pipeline.Pipeline
. Even if explicitly set, will not change behavior offit
.Returns: self :
-
get_decomposition
()[source]¶ Overridden from the :class:
skutil.decomposition.decompose._BaseSelectiveDecomposer
class, this method returns the internal decomposition class:sklearn.decomposition.TruncatedSVD
Returns: self.svd_ :
sklearn.decomposition.TruncatedSVD
The fit internal decomposition class
-
transform
(X)[source]¶ Transform a test matrix given the already-fit transformer.
Parameters: X : Pandas
DataFrame
, shape=(n_samples, n_features)The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : Pandas
DataFrame
, shape=(n_samples, n_features)The operation is applied to a copy of
X
, and the result set is returned.
-