EncodersΒΆ

Selected implemented encoders:

  • OneHotCategoricalEncoder: - Should be the first phase in your Pipeline object. Takes a Pandas dataframe, imputes missing categorical data with a provided string and dummies out the object (string) columns. Finally, returns a numpy.ndarray transformed array.
  • SafeLabelEncoder: - Wraps sklearn’s LabelEncoder, but encodes unseen data in your test set as a default factor-level value (99999).
## Example use of OneHotCategoricalEncoder
import numpy as np
from skutil.preprocessing import SafeLabelEncoder, OneHotCategoricalEncoder
import pandas as pd

## An array of strings
X = np.array([['USA','RED','a'],
              ['MEX','GRN','b'],
              ['FRA','RED','b']])
x = pd.DataFrame.from_records(data = X, columns = ['A','B','C'])

## Tack on a numeric col:
x['n'] = np.array([5,6,7])

## Fit the encoder -- default return is pandas DataFrame
o = OneHotCategoricalEncoder(as_df=False).fit(x)

## Notice that the numeric data is now BEFORE the dummies
>>> o.transform(x)
[[ 5.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
 [ 6.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.],
 [ 7.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.]]

## We can extract the new names:
>>> o.trans_nms_
['n', 'A.FRA', 'A.MEX', 'A.USA', 'A.NA', 'B.GRN', 'B.RED', 'B.NA', 'C.a', 'C.b', 'C.NA']

## Notice we have one extra factor level for each column (i.e., 'A.NA').
## This is to hold factor levels in testing that we didn't see in training.
## Most sklearn algorithms will shrink that coefficient to zero in training,
## or completely ignore it so it's merely a placeholder for elegant handling
## of new data. Let's test what happens on unseen data:
Y = np.array([['CAN','BLU','c']])
y = pd.DataFrame.from_records(data = Y, columns = ['A','B','C'])

## Add the numeric var in at the end
y['n'] = np.array([7])
>>> o.transform(y)
[[ 7.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.]]

## Notice only the 'x.NA' features are populated!