skoot.feature_extraction.DateFactorizer

class skoot.feature_extraction.DateFactorizer(cols=None, as_df=True, drop_original=True, sep='_', features=('year', 'month', 'day', 'hour'))[source][source]

Extract new features from datetime features.

Automatically extract new features from datetime features. This class operates on datetime series objects and extracts features such as “year”, “month”, etc. These can then be expanded via one-hot encoding or further processed via other pre-processing techniques.

Parameters:

cols : array-like, shape=(n_features,), optional (default=None)

The names of the columns on which to apply the transformation. Will apply to all columns if None specified. Note that in this class, the columns applied-to must be DateTime types or this will raise a ValueError.

as_df : bool, optional (default=True)

Whether to return a Pandas DataFrame in the transform method. If False, will return a Numpy ndarray instead.

drop_original : bool, optional (default=True)

Whether to drop the original features from the dataframe prior to returning from the transform method.

sep : str or unicode, optional (default=”_”)

The string separator between the existing feature name and the extracted feature. E.g., for a feature named “Transaction” and for features=("year", "month"), the original variable will be split into two new ones: “Transaction_year” and “Transaction_month”.

features : iterable, optional (default=(“year”, “month”, “day”, “hour”))

The features to extract. These are attributes of the DateTime class and will raise an AttributeError if an invalid feature is passed.

Attributes

fit_cols_ (list) The columns the transformer was fit on.

Examples

>>> import pandas as pd
>>> from datetime import datetime as dt
>>> strp = dt.strptime
>>> data = [
...     [1, dt.strptime("06-01-2018", "%m-%d-%Y")],
...     [2, dt.strptime("06-02-2018", "%m-%d-%Y")],
...     [3, dt.strptime("06-03-2018", "%m-%d-%Y")],
...     [4, dt.strptime("06-04-2018", "%m-%d-%Y")],
...     [5, None]
... ]
>>> df = pd.DataFrame.from_records(data, columns=["a", "b"])
>>> DateFactorizer(cols=['b']).fit_transform(df)
   a  b_year  b_month  b_day  b_hour
0  1  2018.0      6.0    1.0     0.0
1  2  2018.0      6.0    2.0     0.0
2  3  2018.0      6.0    3.0     0.0
3  4  2018.0      6.0    4.0     0.0
4  5     NaN      NaN    NaN     NaN

Methods

fit(X[, y]) Fit the date factorizer.
fit_transform(X[, y]) Fit the estimator and apply the date factorization to a dataframe.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Apply the date transformation to a dataframe.
__init__(cols=None, as_df=True, drop_original=True, sep='_', features=('year', 'month', 'day', 'hour'))[source][source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None)[source][source]

Fit the date factorizer.

This is a tricky class because the “fit” isn’t super necessary… But we use it as a validation stage to ensure the defined cols genuinely are datetime columns. That’s the only reason this all happens in the fit portion.

Parameters:

X : pd.DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The frame will only be fit on the prescribed cols (see __init__) or all of them if cols is None.

y : array-like or None, shape=(n_samples,), optional (default=None)

Pass-through for sklearn.pipeline.Pipeline.

fit_transform(X, y=None, **kwargs)[source][source]

Fit the estimator and apply the date factorization to a dataframe.

This is a tricky class because the “fit” isn’t super necessary… But we use it as a validation stage to ensure the defined cols genuinely are datetime types. That’s the only reason this all happens in the fit portion.

Parameters:

X : pd.DataFrame, shape=(n_samples, n_features)

The Pandas frame to fit. The operation will be applied to a copy of the input data, and the result will be returned.

y : array-like or None, shape=(n_samples,), optional (default=None)

Pass-through for sklearn.pipeline.Pipeline.

Returns:

X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The operation is applied to a copy of X, and the result set is returned.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self
transform(X)[source][source]

Apply the date transformation to a dataframe.

This method will extract features from datetime features as specified by the features arg.

Parameters:

X : pd.DataFrame, shape=(n_samples, n_features)

The Pandas frame to transform. The operation will be applied to a copy of the input data, and the result will be returned.

Returns:

X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The operation is applied to a copy of X, and the result set is returned.

Examples using skoot.feature_extraction.DateFactorizer