skoot.feature_extraction
.TimeDeltaFeatures¶
-
class
skoot.feature_extraction.
TimeDeltaFeatures
(cols=None, as_df=True, units='days', sep='_', astype=<class 'float'>, absolute_difference=False, name_suffix='delta')[source][source]¶ Compute the time lapse between timestamp events.
A transformer to compute time deltas between different date features. This can be useful, for instance, when the target is temporally sensitive to the lapse in time between certain events.
This class will combinatorially calculate the deltas between features, expanding the dimensionality by \({N \choose 2}\), where \(N\) is the number of features included in
cols
. Note that prescribed column order does matter in this transformer, as deltas are computed from left to right:['a', 'b', 'c'] -> ['a_b_delta', 'a_c_delta', 'b_c_delta']
Parameters: cols : array-like, shape=(n_features,), optional (default=None)
The names of the columns on which to apply the transformation. Will apply to all columns if None specified. Note that in this class, the columns applied-to must be DateTime types or this will raise a ValueError.
as_df : bool, optional (default=True)
Whether to return a Pandas
DataFrame
in thetransform
method. If False, will return a Numpyndarray
instead.units : str or unicode, optional (default=’days’)
The unit of time to compute between events. One of (‘seconds’, ‘minutes’, ‘hours’, ‘days’).
sep : str or unicode (optional, default=”_”)
The separator between the new feature names. The names will be in the form of:
<left><sep><right><sep><suffix>
For examples, for columns ‘a’ and ‘b’,
sep="_"
andname_suffix="delta"
, the new column name would be:a_b_delta
astype : type, optional (default=float)
The type to which to coerce the time deltas.
absolute_difference : bool, optional (default=False)
Whether to compute the absolute difference between dates. If False, the order of
cols
will matter, as that defines the subtractive order. (right-most columns will be subtracted from the left combinatorially)name_suffix : str, optional (default=’delta’)
The suffix to add to the new feature name in the form of:
<feature_x>_<feature_y>_<suffix>
See
sep
for more details about how new column names are formed.Notes
- Unlike the
DateFactorizer
class, this transformer does not remove the original date features after extracting the new features. - Column deltas are computed from left to right. This means that the order
in which columns are defined in
cols
does matter.
Examples
>>> import pandas as pd >>> from datetime import datetime as dt >>> stp = dt.strptime >>> data = [ ... [1, stp("06-01-2018", "%m-%d-%Y"), stp("06-02-2018", "%m-%d-%Y")], ... [2, stp("06-02-2018", "%m-%d-%Y"), stp("06-03-2018", "%m-%d-%Y")], ... [3, stp("06-03-2018", "%m-%d-%Y"), stp("06-04-2018", "%m-%d-%Y")], ... [4, stp("06-04-2018", "%m-%d-%Y"), stp("06-05-2018", "%m-%d-%Y")], ... [5, None, stp("06-04-2018", "%m-%d-%Y")] ... ] >>> df = pd.DataFrame.from_records(data, columns=['a', 'b', 'c']) >>> tdf = TimeDeltaFeatures(cols=['b', 'c'], units='hours') >>> tdf.fit_transform(df) a b c b_c_delta 0 1 2018-06-01 2018-06-02 -24.0 1 2 2018-06-02 2018-06-03 -24.0 2 3 2018-06-03 2018-06-04 -24.0 3 4 2018-06-04 2018-06-05 -24.0 4 5 NaT 2018-06-04 NaN
Notice that column order makes a difference. If ‘c’ is defined before ‘b’, the delta is positive:
>>> TimeDeltaFeatures(cols=['c', 'b'], units='hours').fit_transform(df) a b c c_b_delta 0 1 2018-06-01 2018-06-02 24.0 1 2 2018-06-02 2018-06-03 24.0 2 3 2018-06-03 2018-06-04 24.0 3 4 2018-06-04 2018-06-05 24.0 4 5 NaT 2018-06-04 NaN
Methods
fit
(X[, y])Fit the time-between transformer. fit_transform
(X[, y])Fit the estimator and apply the date factorization to a dataframe. get_params
([deep])Get parameters for this estimator. set_params
(**params)Set the parameters of this estimator. transform
(X)Apply the date transformation to a dataframe. -
__init__
(cols=None, as_df=True, units='days', sep='_', astype=<class 'float'>, absolute_difference=False, name_suffix='delta')[source][source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y=None)[source][source]¶ Fit the time-between transformer.
This is a tricky class because the “fit” isn’t super necessary… But we use it as a validation stage to ensure the defined cols genuinely are datetime columns. That’s the only reason this all happens in the fit portion.
Parameters: X : pd.DataFrame, shape=(n_samples, n_features)
The Pandas frame to fit. The frame will only be fit on the prescribed
cols
(see__init__
) or all of them ifcols
is None.y : array-like or None, shape=(n_samples,), optional (default=None)
Pass-through for
sklearn.pipeline.Pipeline
.
-
fit_transform
(X, y=None, **kwargs)[source][source]¶ Fit the estimator and apply the date factorization to a dataframe.
This is a tricky class because the “fit” isn’t super necessary… But we use it as a validation stage to ensure the defined cols genuinely are datetime types. That’s the only reason this all happens in the fit portion.
Parameters: X : pd.DataFrame, shape=(n_samples, n_features)
The Pandas frame to fit. The operation will be applied to a copy of the input data, and the result will be returned.
y : array-like or None, shape=(n_samples,), optional (default=None)
Pass-through for
sklearn.pipeline.Pipeline
.Returns: X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)
The operation is applied to a copy of
X
, and the result set is returned.
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params : mapping of string to any
Parameter names mapped to their values.
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: self
-
transform
(X)[source][source]¶ Apply the date transformation to a dataframe.
This method will compute the deltas between provided datetime features.
Parameters: X : pd.DataFrame, shape=(n_samples, n_features)
The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.
Returns: X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)
The operation is applied to a copy of
X
, and the result set is returned.
- Unlike the