An intro to dummy encoding with Skoot

This post will introduce you to dummy coding in skoot, one of my projects dedicated to helping machine learning practitioners automate as much of their workflow as possible. Those who have worked in the field for a while know that 80 - 90% of a data scientist’s time is spent solely on cleaning up data or building bespoke transformers to fit into an eventual production pipeline—skoot aims to solve exactly this problem by abstracting common transformer classes and data cleansing tasks into a reusable API.

Note that this is a very high level intro to the package and that the full package documentation is available for review here

Mo’ data, mo’ problems

(Kinda and not to say you’d ever ask for less data. But you know what I’m getting at…)

Imagine a client comes at you with a business question and hands you all the data you’ll need to solve it. Is it ever sparkling clean and free of errors (typographical, erroneous sensor values, data omission or other)?

NO! Even when the data has been used for modeling before, you’ll generally spend a significant amount of time cleaning your data, and the more features you have, the more time you’ll spend on data cleansing tasks.

Let’s say you’re given the following dataset (the “adult data set” available on the UCI repo; ~3.8MB):

age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	target
39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50k
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K

Our aim in this dataset is to predict whether a person makes less than or greater than $50k (binary classification). It’s immediately recognizable that there are several different datatypes that will require transformations for us to be able to perform any modeling. Typically, a data scientist would spend an immense amount of time on cleaning up data and preparing meaningful features for modeling. With skoot, we can begin to chip away at this bottleneck in a matter of minutes.

Converting categorical fields to numeric fields

If you want the cleanest pipeline possible, you’ll end up building several custom TransformerMixin classes over the course of your modeling, one of which typically handles categorical encoding and dummy variables. There are a number of solutions to this problem out there, including the pd.get_dummies, but not all of them account for two issues that Skoot does:

What happens if there are unknown levels in the test data?
How can we avoid the dummy variable trap?

Skoot addresses these for us seamlessly. If we look at the dtypes of the dataset, we can identify which will need dummy-encoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("~/Downloads/adult.data.txt", header=None,
                 names=["age", "workclass", "fnlwgt", "education", 
                        "education-num", "marital-status", 
                        "occupation", "relationship", "race", 
                        "sex", "capital-gain", "capital-loss", 
                        "hours-per-week", "native-country", "target"])
y = df.pop("target")
object_cols = df.select_dtypes(["object", 
                                "category"]).columns.tolist()

# with some examination we can see that "education-num" is just 
# an ordinal mirror of "education", so we can drop it
df.drop("education-num", axis=1, inplace=True)

# As always, we need to split our data
X_train, X_test, y_train, y_test = train_test_split(df, y, 
                                                    test_size=0.2,
                                                    random_state=42)

This gives us the following fields as “object” (or string) type:

workclass
education
marital-status
occupation
relationship
race
sex
native-country

With skoot we can very quickly one-hot encode all the categorical variables and drop one level (to avoid the dummy trap). Note that skoot does not force types when defining the DummyEncoder—this is because often times int fields are actually ordinal categorical features that should be encoded (like the “education-num” above). Instead, skoot allows us to define which specific columns on which to apply a transformation:

1
2
3
4
from skoot.preprocessing import DummyEncoder

encoder = DummyEncoder(cols=object_cols, drop_one_level=True)
encoder.fit_transform(X_train).head()

And now our matrix looks like this:

age	fnlwgt	capital-gain	hours-per-week	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	education_ 11th	education_ Bachelors	education_ HS-grad	marital-status_ Divorced	marital-status_ Married-civ-spouse	marital-status_ Never-married	occupation_ Adm-clerical	occupation_ Exec-managerial	occupation_ Handlers-cleaners	occupation_ Prof-specialty	relationship_ Husband	relationship_ Not-in-family	race_ Black	sex_ Female	native-country_ Cuba	native-country_ United-States
39.0	77516.0	2174.0	40.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
50.0	83311.0	0.0	13.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0
38.0	215646.0	0.0	40.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
53.0	234721.0	0.0	40.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
28.0	338409.0	0.0	40.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0

To apply this to your test data, just as with any other scikit-learn transformer, you simply use the transform method:

1
encoder.transform(X_test)

Things to note

The resulting features drop one factor level from each categorical variable if drop_one_level=True is specified (by default).
We address the situation where an unknown factor level is present

Here’s a demo of what happens when there’s a new factor level present:

1
2
3
4
5
6
7
8
9
10
11
12
# select a test row:
test_row = X_test.iloc[0]

# set the country to something that is obviously not real:
test_row.set_value('native-country', "Atlantis")

# transform the new row:
trans2 = encoder.transform(pd.DataFrame([test_row]))

# prove that we did not assign a country encoding:
nc_mask = trans2.columns.str.contains("native-country")
assert trans2[trans2.columns[nc_mask]].sum().sum() == 0

And there you have it! <2 minutes to dummy encode your categorical features. The full code for this example is located in the code folder.

Questions? Technical remarks? Feel free to email me at taylor.smith@alkaline-ml.com