EncodersΒΆ
Selected implemented encoders:
OneHotCategoricalEncoder
: - Should be the first phase in yourPipeline
object. Takes a Pandas dataframe, imputes missing categorical data with a provided string and dummies out the object (string) columns. Finally, returns anumpy.ndarray
transformed array.SafeLabelEncoder
: - Wraps sklearn’sLabelEncoder
, but encodes unseen data in your test set as a default factor-level value (99999).
## Example use of OneHotCategoricalEncoder
import numpy as np
from skutil.preprocessing import SafeLabelEncoder, OneHotCategoricalEncoder
import pandas as pd
## An array of strings
X = np.array([['USA','RED','a'],
['MEX','GRN','b'],
['FRA','RED','b']])
x = pd.DataFrame.from_records(data = X, columns = ['A','B','C'])
## Tack on a numeric col:
x['n'] = np.array([5,6,7])
## Fit the encoder -- default return is pandas DataFrame
o = OneHotCategoricalEncoder(as_df=False).fit(x)
## Notice that the numeric data is now BEFORE the dummies
>>> o.transform(x)
[[ 5., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0.],
[ 6., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[ 7., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.]]
## We can extract the new names:
>>> o.trans_nms_
['n', 'A.FRA', 'A.MEX', 'A.USA', 'A.NA', 'B.GRN', 'B.RED', 'B.NA', 'C.a', 'C.b', 'C.NA']
## Notice we have one extra factor level for each column (i.e., 'A.NA').
## This is to hold factor levels in testing that we didn't see in training.
## Most sklearn algorithms will shrink that coefficient to zero in training,
## or completely ignore it so it's merely a placeholder for elegant handling
## of new data. Let's test what happens on unseen data:
Y = np.array([['CAN','BLU','c']])
y = pd.DataFrame.from_records(data = Y, columns = ['A','B','C'])
## Add the numeric var in at the end
y['n'] = np.array([7])
>>> o.transform(y)
[[ 7., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.]]
## Notice only the 'x.NA' features are populated!