This post will introduce you to dummy coding in skoot, one of my projects dedicated to helping machine learning practitioners automate as much of their workflow as possible. Those who have worked in the field for a while know that 80 - 90% of a data scientist’s time is spent solely on cleaning up data or building bespoke transformers to fit into an eventual production pipeline—skoot aims to solve exactly this problem by abstracting common transformer classes and data cleansing tasks into a reusable API.
Note that this is a very high level intro to the package and that the full package documentation is available for review here
Mo’ data, mo’ problems
(Kinda and not to say you’d ever ask for less data. But you know what I’m getting at…)
Imagine a client comes at you with a business question and hands you all the data you’ll need to solve it. Is it ever sparkling clean and free of errors (typographical, erroneous sensor values, data omission or other)?
NO! Even when the data has been used for modeling before, you’ll generally spend a significant amount of time cleaning your data, and the more features you have, the more time you’ll spend on data cleansing tasks.
Let’s say you’re given the following dataset (the “adult data set” available on the UCI repo; ~3.8MB):
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50k |
50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
Our aim in this dataset is to predict whether a person makes less than or greater than $50k (binary classification). It’s immediately recognizable that there are several different datatypes that will require transformations for us to be able to perform any modeling. Typically, a data scientist would spend an immense amount of time on cleaning up data and preparing meaningful features for modeling. With skoot, we can begin to chip away at this bottleneck in a matter of minutes.
Converting categorical fields to numeric fields
If you want the cleanest pipeline possible, you’ll end up building several custom TransformerMixin
classes over the course of your modeling, one of which typically handles categorical encoding and dummy variables. There are a number of solutions to this problem out there, including the pd.get_dummies
, but not all of them account for two issues that Skoot does:
- What happens if there are unknown levels in the test data?
- How can we avoid the dummy variable trap?
Skoot addresses these for us seamlessly. If we look at the dtypes of the dataset, we can identify which will need dummy-encoding:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("~/Downloads/adult.data.txt", header=None,
names=["age", "workclass", "fnlwgt", "education",
"education-num", "marital-status",
"occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss",
"hours-per-week", "native-country", "target"])
y = df.pop("target")
object_cols = df.select_dtypes(["object",
"category"]).columns.tolist()
# with some examination we can see that "education-num" is just
# an ordinal mirror of "education", so we can drop it
df.drop("education-num", axis=1, inplace=True)
# As always, we need to split our data
X_train, X_test, y_train, y_test = train_test_split(df, y,
test_size=0.2,
random_state=42)
This gives us the following fields as “object” (or string) type:
- workclass
- education
- marital-status
- occupation
- relationship
- race
- sex
- native-country
With skoot we can very quickly one-hot encode all the categorical variables and drop one level (to avoid the dummy trap). Note that skoot does not force types when defining the DummyEncoder
—this is because often times int
fields are actually ordinal categorical features that should be encoded (like the “education-num” above). Instead, skoot allows us to define which specific columns on which to apply a transformation:
1
2
3
4
from skoot.preprocessing import DummyEncoder
encoder = DummyEncoder(cols=object_cols, drop_one_level=True)
encoder.fit_transform(X_train).head()
And now our matrix looks like this:
age | fnlwgt | capital-gain | capital-loss | hours-per-week | workclass_ ? | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | workclass_ Self-emp-inc | workclass_ Self-emp-not-inc | workclass_ State-gov | education_ 10th | education_ 11th | education_ 12th | education_ 1st-4th | education_ 5th-6th | education_ 7th-8th | education_ 9th | education_ Assoc-acdm | education_ Assoc-voc | education_ Bachelors | education_ Doctorate | education_ HS-grad | education_ Masters | education_ Preschool | education_ Prof-school | marital-status_ Divorced | marital-status_ Married-AF-spouse | marital-status_ Married-civ-spouse | marital-status_ Married-spouse-absent | marital-status_ Never-married | marital-status_ Separated | occupation_ ? | occupation_ Adm-clerical | occupation_ Armed-Forces | occupation_ Craft-repair | occupation_ Exec-managerial | occupation_ Farming-fishing | occupation_ Handlers-cleaners | occupation_ Machine-op-inspct | occupation_ Other-service | occupation_ Priv-house-serv | occupation_ Prof-specialty | occupation_ Protective-serv | occupation_ Sales | occupation_ Tech-support | relationship_ Husband | relationship_ Not-in-family | relationship_ Other-relative | relationship_ Own-child | relationship_ Unmarried | race_ Amer-Indian-Eskimo | race_ Asian-Pac-Islander | race_ Black | race_ Other | sex_ Female | native-country_ ? | native-country_ Cambodia | native-country_ Canada | native-country_ China | native-country_ Columbia | native-country_ Cuba | native-country_ Dominican-Republic | native-country_ Ecuador | native-country_ El-Salvador | native-country_ England | native-country_ France | native-country_ Germany | native-country_ Greece | native-country_ Guatemala | native-country_ Haiti | native-country_ Holand-Netherlands | native-country_ Honduras | native-country_ Hong | native-country_ Hungary | native-country_ India | native-country_ Iran | native-country_ Ireland | native-country_ Italy | native-country_ Jamaica | native-country_ Japan | native-country_ Laos | native-country_ Mexico | native-country_ Nicaragua | native-country_ Outlying-US(Guam-USVI-etc) | native-country_ Peru | native-country_ Philippines | native-country_ Poland | native-country_ Portugal | native-country_ Puerto-Rico | native-country_ Scotland | native-country_ South | native-country_ Taiwan | native-country_ Thailand | native-country_ Trinadad&Tobago | native-country_ United-States | native-country_ Vietnam |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39.0 | 77516.0 | 2174.0 | 0.0 | 40.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
50.0 | 83311.0 | 0.0 | 0.0 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
38.0 | 215646.0 | 0.0 | 0.0 | 40.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
53.0 | 234721.0 | 0.0 | 0.0 | 40.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
28.0 | 338409.0 | 0.0 | 0.0 | 40.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
To apply this to your test data, just as with any other scikit-learn transformer, you simply use the transform
method:
1
encoder.transform(X_test)
Things to note
- The resulting features drop one factor level from each categorical variable if
drop_one_level=True
is specified (by default). - We address the situation where an unknown factor level is present
Here’s a demo of what happens when there’s a new factor level present:
1
2
3
4
5
6
7
8
9
10
11
12
# select a test row:
test_row = X_test.iloc[0]
# set the country to something that is obviously not real:
test_row.set_value('native-country', "Atlantis")
# transform the new row:
trans2 = encoder.transform(pd.DataFrame([test_row]))
# prove that we did not assign a country encoding:
nc_mask = trans2.columns.str.contains("native-country")
assert trans2[trans2.columns[nc_mask]].sum().sum() == 0
And there you have it! <2 minutes to dummy encode your categorical features. The full code for this example is located in the code folder.
Questions? Technical remarks? Feel free to email me at taylor.smith@alkaline-ml.com