Encoders
=========
Selected implemented encoders:

- ``OneHotCategoricalEncoder``:
  - Should be the first phase in your ``Pipeline`` object. Takes a Pandas dataframe, imputes missing categorical data with a provided string and dummies out the object (string) columns. Finally, returns a ``numpy.ndarray`` transformed array.
- ``SafeLabelEncoder``:
  - Wraps sklearn's ``LabelEncoder``, but encodes unseen data in your test set as a default factor-level value (99999).

.. code-block:: python

   ## Example use of OneHotCategoricalEncoder
   import numpy as np
   from skutil.preprocessing import SafeLabelEncoder, OneHotCategoricalEncoder
   import pandas as pd

   ## An array of strings
   X = np.array([['USA','RED','a'],
                 ['MEX','GRN','b'],
                 ['FRA','RED','b']])
   x = pd.DataFrame.from_records(data = X, columns = ['A','B','C'])

   ## Tack on a numeric col:
   x['n'] = np.array([5,6,7])

   ## Fit the encoder -- default return is pandas DataFrame
   o = OneHotCategoricalEncoder(as_df=False).fit(x)

   ## Notice that the numeric data is now BEFORE the dummies
   >>> o.transform(x)
   [[ 5.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
    [ 6.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.],
    [ 7.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.]]

   ## We can extract the new names:
   >>> o.trans_nms_
   ['n', 'A.FRA', 'A.MEX', 'A.USA', 'A.NA', 'B.GRN', 'B.RED', 'B.NA', 'C.a', 'C.b', 'C.NA']

   ## Notice we have one extra factor level for each column (i.e., 'A.NA').
   ## This is to hold factor levels in testing that we didn't see in training.
   ## Most sklearn algorithms will shrink that coefficient to zero in training,
   ## or completely ignore it so it's merely a placeholder for elegant handling
   ## of new data. Let's test what happens on unseen data:
   Y = np.array([['CAN','BLU','c']])
   y = pd.DataFrame.from_records(data = Y, columns = ['A','B','C'])

   ## Add the numeric var in at the end
   y['n'] = np.array([7])
   >>> o.transform(y)
   [[ 7.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.]]

   ## Notice only the 'x.NA' features are populated!