skutil.decomposition module¶

skutil.decomposition provides sklearn decompositions (PCA, TruncatedSVD) within the skutil API, i.e., allowing such transformers to operate on a select subset of columns rather than the entire matrix.

class skutil.decomposition.SelectivePCA(cols=None, n_components=None, whiten=False, weight=False, as_df=True)[source]¶

Bases: skutil.decomposition.decompose._BaseSelectiveDecomposer

A class that will apply PCA only to a select group of columns. Useful for data that may contain a mix of columns that we do and don’t want to decompose.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

n_components : int, float, None or string, optional (default=None)

The number of components to keep, per sklearn:

if n_components is not set, all components are kept:

n_components == min(n_samples, n_features)

if n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used

to guess the dimension.

if 0 < n_components < 1 and svd_solver == ‘full’, select the number

of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

n_components cannot be equal to n_features for svd_solver == ‘arpack’.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

whiten : bool, optional (default False)

When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

weight : bool, optional (default False)

When True (False by default) the explained_variance_ vector is used to weight the features post-transformation. This is especially useful in clustering contexts, where features are all implicitly assigned the same importance, even though PCA by nature orders the features by importance (i.e., not all components are created equally). When True, weighting will subtract the median variance from the weighting vector, and add one (so as not to down sample or upsample everything), then multiply the weights across the transformed features.

Attributes:

pca_ : the PCA object

Examples

>>> from skutil.decomposition import SelectivePCA
>>> from skutil.utils import load_iris_df
>>>
>>> X = load_iris_df(include_tgt=False)
>>> pca = SelectivePCA(n_components=2)
>>> X_transform = pca.fit_transform(X) # pca suffers sign indeterminancy and results will vary
>>> assert X_transform.shape[1] == 2

Methods

`fit`(X[, y])	Fit the transformer.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_decomposition`()	Overridden from the :class:`skutil.decomposition.decompose._BaseSelectiveDecomposer` class,
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(X)	Given a transformed dataframe, inverse the transformation.
`score`(X[, y])	Return the average log-likelihood of all samples.
`set_params`(\\params)	Set the parameters of this estimator.
`transform`(X)	Transform a test matrix given the already-fit transformer.

fit(X, y=None)[source]¶

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

get_decomposition()[source]¶

Overridden from the :class:skutil.decomposition.decompose._BaseSelectiveDecomposer class, this method returns the internal decomposition class: sklearn.decomposition.PCA

Returns:

self.pca_ : sklearn.decomposition.PCA

The fit internal decomposition class

score(X, y=None)[source]¶

Return the average log-likelihood of all samples. This calls sklearn.decomposition.PCA’s score method on the specified columns [1].

Parameters:

X: Pandas ``DataFrame``, shape=(n_samples, n_features) :

The data to score.

y: None :

Passthrough for pipeline/gridsearch

Returns:

ll: float :

Average log-likelihood of the samples under the fit PCA model (self.pca_)

References

[R1]	Bishop, C. “Pattern Recognition and Machine Learning” 12.2.1 p. 574 http://www.miketipping.com/papers/met-mppca.pdf

transform(X)[source]¶

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame

The operation is applied to a copy of X, and the result set is returned.

class skutil.decomposition.SelectiveTruncatedSVD(cols=None, n_components=2, algorithm='randomized', n_iter=5, as_df=True)[source]¶

Bases: skutil.decomposition.decompose._BaseSelectiveDecomposer

A class that will apply truncated SVD (LSA) only to a select group of columns. Useful for data that contains categorical features that have not yet been dummied, or for dummied features we don’t want decomposed. TruncatedSVD is the equivalent of Latent Semantic Analysis, and returns the “concept space” of the decomposed features.

Parameters:

cols : array_like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. If no column names are provided, the transformer will be fit on the entire frame. Note that the transformation will also only apply to the specified columns, and any other non-specified columns will still be present after transformation.

n_components : int, (default=2)

Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.

algorithm : string, (default=”randomized”)

SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).

n_iter : int, optional (default=5)

Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead. Since most skutil transformers depend on explicitly-named DataFrame features, the as_df parameter is True by default.

Attributes:

svd_ : the SVD object

Examples

>>> from skutil.decomposition import SelectiveTruncatedSVD
>>> from skutil.utils import load_iris_df
>>>
>>> X = load_iris_df(include_tgt=False)
>>> svd = SelectiveTruncatedSVD(n_components=2)
>>> X_transform = svd.fit_transform(X) # svd suffers sign indeterminancy and results will vary
>>> assert X_transform.shape[1] == 2

Methods

`fit`(X[, y])	Fit the transformer.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_decomposition`()	Overridden from the :class:`skutil.decomposition.decompose._BaseSelectiveDecomposer` class,
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(X)	Given a transformed dataframe, inverse the transformation.
`set_params`(\\params)	Set the parameters of this estimator.
`transform`(X)	Transform a test matrix given the already-fit transformer.

fit(X, y=None)[source]¶

Fit the transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None. Furthermore, X will not be altered in the process of the fit.

y : None

Passthrough for sklearn.pipeline.Pipeline. Even if explicitly set, will not change behavior of fit.

Returns:

self :

get_decomposition()[source]¶

Overridden from the :class:skutil.decomposition.decompose._BaseSelectiveDecomposer class, this method returns the internal decomposition class: sklearn.decomposition.TruncatedSVD

Returns:

self.svd_ : sklearn.decomposition.TruncatedSVD

The fit internal decomposition class

transform(X)[source]¶

Transform a test matrix given the already-fit transformer.

Parameters:

X : Pandas DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : Pandas DataFrame, shape=(n_samples, n_features)

The operation is applied to a copy of X, and the result set is returned.