The issue
[Read More]
How to ensure model obsolescence (part 2)
Data dredging
In the spirit of my last post, I want to continue talking about some common mistakes I see among machine learning practitioners. Last time, we saw how covariate shift can be accidentally introduced by (seemingly harmlessly) applying a fit_transform to your test data. This time, I want to cover an...
[Read More]
How to ensure model obsolescence (part 1)
Fitting your test set and other terrible practices in ML
Throughout my tenure in the field of data science and machine learning, I’ve had the privilege to interview a lot of candidates. Some interviews have gone splendidly; others have gone horrendously. And, to the credit of those who have bombed, sometimes you’re just nervous and trip up, I get it....
[Read More]
Conda envs in Pyspark
3 reasons you should be deploying your Conda environments for your Pyspark jobs
If you’ve only ever tinkered with Hadoop within the context of a sandbox, you may never have encountered one of the inevitabililities of Enterprise-scale distributed computing: different machines have different configurations. Even when synchronized with tools such as Puppet, datanodes in a Hadoop cluster may not be a mirror image...
[Read More]
An intro to dummy encoding with Skoot
Using Skoot to accelerate your ML pre-processing workflow
This post will introduce you to dummy coding in skoot, one of my projects dedicated to helping machine learning practitioners automate as much of their workflow as possible. Those who have worked in the field for a while know that 80 - 90% of a data scientist’s time is spent...
[Read More]