Published on August 24, 2016
Forrest Gump famously said that life is like a box of chocolates. He may well be right. But predictive modeling is less a box of chocolates, and more a tube of toothpaste.
In each tube, there is a certain amount of toothpaste available. Think of this toothpaste as information or predictive capability. The same dataset may have varying amounts of “toothpaste” for different predictive purposes … but let’s ignore that for now. For a given investigation, there is only so much information to be squeezed from a dataset.
Why is this? Imagine for a moment trying to predict Monarch butterfly population from Nasdaq stock index data. They aren’t correlated. To take it back to our analogy … there isn’t any toothpaste in this particular tube.
Similarly, imagine predicting BMW automobile ownership from personal income. The correlation is better than the prior example … but still far from perfect. Most BMWs are indeed owned by individuals with high income, but most individuals with high income don’t own BMWs. This would be a toothpaste tube with some toothpaste, but nowhere near full.
So how do we get the toothpaste out in order to use it? The brute force approach would be to grind up the tube and all contents, and then use that for brushing. While we would never do this for our toothpaste, the mathematical equivalent is sadly pretty common. But much of what we get in that case is tube … not toothpaste, which decreases the cavity protection (not to mention the taste). As we all know, extracting the toothpaste from the tube is the best approach.
Simple models are like squeezing the toothpaste with your hands. Quite effective for full tubes, but poor for tubes with less in them. High level techniques are like using an army of nano-robots to scrape out all possible toothpaste. Sometimes excessive … but can be super effective when needed.
However, we must be careful. Even nano-robots can scrape off tube material, contaminating the toothpaste. In data science, this would be called overfitting. It makes model results look great … until you try to use the information to predict on new data.
So how do you get the most toothpaste out of your tube?
1. Limit the data used to that which is likely to be predictive. There are many ways to do this, but the simplest is to ask a subject matter expert for their opinion. This may not pick up obscure dependencies, but is a great first pass. If you are looking for a mathematical solution, the simplest would be a correlation test, and better (although far more computationally intensive) would be a random forest parameter significance test.
2. Use models appropriate to the need. Nano-robots are great… if you need them. But why bother if you get all the toothpaste you need by using your hands? There are literally thousands of options, but these two links (Modeling Basics and Types of Mathematical Models) are good places to start.
3. Look for help when you need it. Data science is an actively developing field. Don’t be scared to admit that you don’t know everything. Nobody does. The only way to learn is to dive in, try things, and ask questions when stuck. Check out the Cross Validated website as a great starting point for answers.
So now you know why data science is like a tube of toothpaste. Just like momma always said.