Humans lie to themselves. After the 2016 presidential election, the New York Times ran a headline representing national consensus: “How Data Failed Us in Calling an Election.” The polling data, gathered at a state and national level, had showed Hillary Clinton in the lead and likely to win the election. Using the words “likely” and “probably” were a fair representation of her position, but sites and publications such as the New York Times, Huffington Post and Princeton Election Consortium (PEC) went above and beyond, putting Clinton’s chances at 85, 98 and 99.7 percent respectively.
Those predictions weren’t necessarily wrong, however. Rather, “wrong” and “right” is not the correct framework through which to view predictions. Since prediction can never be perfect, a model being right or wrong on a specific case doesn’t matter so much as whether the method was sound. Whether a particular case was right or wrong can be attributed to “noise” or “chaos,” which can’t be fixed. By “chaos” and “noise” I don’t mean adjectives that describe me trying to make it to a 9:30 class, but instead I mean (and this is much worse) math. Boiled down, chaos theory states that unless we have every piece of data, we will never be perfectly accurate in our predictions. Given this truth, whether the model is right in a macro sense (i.e.it does the best job it can with the data we do have) is controllable.
The proper way to understand polling and predictions becomes an exercise in calibration. Calibration? It is just the idea that if you predict something to happen 15 out of 100 times, that thing should happen 15 out of 100 times. So the question becomes: Should we believe a model that says a candidate with a lead like Clinton’s would—as the PEC suggested—lose only three out of 1000 races?
Clinton led by only a few percentage points in key swing states, a lead that could be wiped out by a systematic, correlated error in polling. These happen about a third of the time, one of which occurred in 2016. Clinton lost on a razor’s edge that was clearly possible based on the data. We know that polling errors, even of this magnitude, are inevitable. If we imagine 100 races, and a candidate with a lead like Clinton’s only loses a race if there is a systematic polling error, and systematic polling errors happen 33 out of 100 times, then that candidate would win 67 out of 100 races. With that in mind, the prediction for Clinton should have been closer to 67 percent, not the 99.7 conjured up by the PEC. Clinton would not have lost three out of 1000, but rather 300 of 1000.
To put that into perspective, that’s the difference between correctly guessing someone is a Democrat (one in three), and blowing a 28-3 lead in the Super Bowl (Which has happened once. I’m sorry Atlanta). There was nothing wrong with the data, merely with how it was interpreted. Polling is fine. Humans, even those whose jobs depend on it, just suck at understanding error, probability and correlation.
Now that I’ve cleared the name of my good friend data, I’d like to spend the rest of my time giving you the breakdown you need of different types of polls: individual, aggregated, state, national, general and primary. A few key facts to remember: Polls as a whole are not biased towards any political party or affiliation, polls are better closer to the event than further out, and polls are relatively accurate.
Individual polls are probably the most common that you see making headlines, and they can be conducted at a state level or a national level, for general elections or for party primaries. Individual polls are of varying quality. Some are bad, and should definitely be taken with a lot of salt, and some are better and should still be taken with a prototypical grain of salt. When judging and taking positions based on polls, you should probably look to see how the pollster is rated; some have good track records of following standard procedure for conducting polls and generating consistent results, while others do not. You should also be aware that the polls most likely to make the headlines are outliers, because they have a larger shock value. They probably don’t represent the consensus view around the poll subject.
Aggregated polls are just individual polls bundled together. Assuming that the bad polls have been weighted correctly and have less of an impact on the final numbers than the better polls, then aggregated polls are almost absolutely better than individual polls. Aggregated polls are less noisy. That is, the final value is less influenced by small samples or potential biases in the pollsters’ questions. The main concerns with aggregated polls are herding behavior and systematic error. Herding is the human tendency to avoid rocking the boat, which consciously or subconsciously suppresses polls and opinions that are out of line with the norm. Systematic error is the fact that if one poll is wrong, most likely all related ones are wrong by a similar amount. This was one of the major factors in the surprise of Clinton’s loss, and also related to the impetus for the Great Recession: collateralized debt obligations (CDO) were assumed to be secure because bankers assumed that the risk of foreclosure was uncorrelated; foreclosures were in fact correlated and made the CDOs thousands of times riskier than originally believed. Both issues share the same root: believing correlated events (be it polling errors or foreclosure risks) weren’t actually correlated. Even so, if given the choice between using aggregated and individual polls, always go for the aggregated (unless you see the word “unskewed”—then run as far away as possible).
I can see (yes, even from here) your eyes glazing over a bit, so let’s switch gears a little (but not too much). State level polls are reliable in their own right, but you should remember that both the number of polls conducted and the sample size of the polls will be smaller than the national level, factors that increase the acceptable level of error in a poll. Of course, state polls shouldn’t be used to gauge a national level of support, and vice versa, unless no other means are possible.
National level polls are much like state polls and often cover the same subjects, but they have a different set of problems. While they have a larger number of polls and a larger sample size, they also create a bit of a problem when calculating how likely someone is to win an election. The reason? Primary elections and general elections are given specific rules for how points are to be allotted, meaning that national-level polls have to be converted to specific state by state or area by area chance-to-win point values. Again, the data is most likely sound, but the method used back in 2016 that converted Clinton’s three point lead in the polls into an 85 percent chance of winning Florida or New Hampshire should have its flaws examined and checked to see how well-calibrated it is.
Before I get to general and primary polls, I would like to remind you that time is a factor in poll accuracy, and we are very far away from the actual general election (this is evidenced by the fact that there are still enough candidates to have a 3 v. 3 basketball game and still be able to substitute from the bench). Bearing that in mind, here we go:
This far out, general election polling is an absolute crock, although once the primary field winnows down a bit more and some time passes, they’ll start to become more useful.
Primary results are a bit more predictive, and it’s probably fair to say that the frontrunner right now is your best bet for the winner, but that “best bet” might not be a good bet. That’s why I say that Biden is most likely to be nominated, but is an underdog to the field. Right now, he has approximately 28 percent support, which puts him ahead of all other candidates. That being said, he doesn’t have 72 percent of the Democratic base, and if any other candidate can cannibalize the supporters of those that have dropped out, it’s possible he will lose the nomination. We don’t know which candidate in particular will gain at the expense of others (only one person can win the nomination, so somebody has to), but we know that someone will and that’s enough to say that Biden probably won’t win the nomination.
All we can do is embrace the uncertainty. Will Biden win the nomination? Probably not. Will Warren? Probably not. Will Sanders? Probably not. Will I? Definitely not. The answer is the same for every candidate (“probably not”), but somebody has to win the nomination eventually. So if you had to put money on somebody winning, Biden would be the best bet. But if you had the choice of betting “anybody but Biden,” that would be an even better one.
TLDR; aggregated polls close to the time of the event are best, but be aware of uncertainty and polling error, specifically error that is correlated. If somebody tells you something in the primary or the general is guaranteed at this point, poke them in the eye.
Photo courtesy of KOMUnews via Flickr