Election Polling and Aggregation Has Problems* - Looking at Election Predictions to Learn Statistics (In a Non-Partisan Manner)

*More problems than many practitioners would admit

Oct 15, 2024

During the 2016 elections, I noticed something for the first time: Trump in a crowded Republican primary made frequent reference to his polling numbers against his opponents. His argument seemed to be, “I’m ahead in the polls, so you should vote for me.” But it wasn’t just Trump that was becoming obsessed with polling; it was us as a society. After doing well predicting the 2008 elections, Nate Silver parlayed that into fame and fortune, and according to some, eventual infamy too. In this 2024 election, we’re all mini-Nate Silvers where we have our favored polls, preferred aggregators, and rules of thumb on how to weigh or discount those polls. I exaggerate, but perhaps not by much; polling matters much more nowadays to us as a society.

In my mind, one of the defining features of the modern age is data. Nowadays we are more than awash in it. We collect it anywhere and everywhere we can. We’re constantly generating new data and trying to tease out insights from it. This blog/project is absolutely a part of that. In the past 100 years or so, with the aid of computers we’re now able to apply statistical techniques cheaply and easily in a way that would have been inconceivable before the mid 1900’s. Alongside this increase in data and the ability to process and analyze it, we also have invented many different techniques to actually do the analysis with. I’m frequently surprised reading up on statistical analysis and finding out that some technique that seems omnipresent nowadays was invented in the 1980’s, if not later.

However, with the proliferation of all this data and analysis and new techniques, comes the simultaneous increase of the odds of us tricking ourselves. As Twain said, there are lies, damned lies, and statistics. Today I’m going to be looking at polling at large and why I think we pay too much attention to minor differences between polling aggregators. While this post will talk about some more advanced statistical concepts, the underlying motivation is to explain in relatively simple language some of the issues with polls and polling aggregation. Ideally by the end of the piece you’ll be able to understand the polling landscape and how to think about it a little better.

Furthermore, it’ll be about statistical thinking in general, and how that relates to predicting elections.

Step One: Defining the Goal

First off, we want to define our goal. It sounds relatively simple, but it’s always important to have a concrete goal in mind. Maybe you’re trying to test a hypothesis, maybe you’re trying to learn something about data you have; it could be essentially almost anything but the important thing is to have that goal in mind and ensure that you’re always working towards that goal. If you don’t, then it’s easier than you think to lose track of it and you end up answering a slightly (if not majorly) different question.

Today, our goal is to be able to accurately predict elections.

How Can We Predict Elections - Just Polls?

So now that we defined our goal let’s think about how we can get closer to it. What we want is a data series that correlates highly (be that positive or negative) to the election. The last sentence is worth expanding on. Again, our goal remains to accurately predict the election. While asking potential voters who they intend to vote for is (likely) correlated to the election outcome, they are not the only data series correlated to it. Additionally, there are reasons that polling data might not be as strongly correlated to election results as we would like.

I also included the bit about correlation being highly positive or negative because either correlation could serve our purposes. Positive Correlation is when A happens, B is more likely to happen and when A doesn’t happen, B is more likely to not happen. Negative Correlation is when A happens then B is less likely to happen, and the opposite. The common format to express correlation is a single number between -1 and 1. If correlation is 1, then A and B are perfectly correlated and A happening guarantees B happening. -1 is perfectly inversely correlated and means A happening guarantees B does NOT. 0 means that there is no relation between A and B.

So with the correlation explanation aside, our goal is the most highly correlated data series to election results1. I wanted to cover this, in addition to stressing the importance of keeping our goal in mind to make a point: if there is data that is more correlated to election results than polling, then we should be using that instead of polling data.

To give an example, economic indicators will be correlated to election results. If lots of people are unemployed, incumbents will likely do worse. If wages are up and filling up at the gas tank is cheap, incumbents will likely do better. So do keep in mind that polls aren’t the only way to predict an election, or necessarily the best.

Why Not Use Polling Data

We just covered that there might be data that’s better for our purposes than polling data. For the rest of the piece, we’ll treat that as if it is not the case, and that we should be using polling data if we want to best predict election results. Still, keep that in mind and ask yourself if the person you’re reading is trying to predict the election or if they’re trying to generate ad revenue on their website by getting people to click on it, for instance. In fact, just as there might be people writing about the election trying to generate clicks instead of accurate predictions, we can imagine a pollster doing the same.

It’s hopefully not controversial to say that polling data influences political reality. I cannot say how much it does, but bad polls ended the candidacy of Joe Biden (or had a very strong influence). Not all polls will be that significant, but there are partisan pollsters who seek to influence political outcomes by publishing biased poll results.

However, we’re still statisticians and we can say that a biased poll can be useful if certain conditions hold. Imagine that there is a pro Trump polling company, and they consistently are 6 points more biased to Trump than actual results. So if they showed Trump +8 then you would expect the actual results to be Trump +2. If that held consistently over time, that would be very useful indeed!

So now we’re getting to the idea of polling aggregation. We won’t have a perfect idea of how biased every poll is, but if we do a good job of adjusting a bunch of polls and averaging them out, then in theory we should have a good idea of how the election will play out.

I’m speaking in highly general terms here, but that is basically what Nate Silver, New York Times, or any polling aggregator is doing. They are taking a bunch of polls, adjusting if there is a bias, and then doing some kind of weighing mechanism to arrive at a final prediction.

I think as a whole, the approach is sound and makes sense. However, I believe that for several reasons, there will always be a minimum amount of uncertainty or guesswork involved. In other words, certain polls or even aggregators can be correct in any given election, but a certain amount of that accuracy is due to luck. If that does not make sense, hopefully the next few sections will make my point clearer. Additionally, it’ll explain what it means for data to be non-stationary and why you would use an Autoregressive Model.

Defining Our Goal: Addendum on Modeling

Above I said that our goal is to be able to accurately predict elections. That is still the case but I thought it would help to go a little more into how we will do that. In statistical literature, we would define our goal as the dependent variable or y (footnote2).

Our goal is to find the independent variables or x’s that would lead us to an accurate y. At the end of the day our equation will look something like this:

Predicted Election Results = Our Input Polling Data * Beta Coefficient That Maximizes Odds of Prediction Election Result Correctly

That’s a mouthful and a half, but that’s what we’re trying to do. This is a fancy version of what you’ll see as a basic linear regression which is something like:

y = x * Beta

Y is the election prediction. x will be the polls that we’re weighing. Beta is just the coefficient that will give us the most accurate results of y. So to make this “real world” we would have:

Predicted Election Result = New York Times Poll * Beta

I don’t want to use tons of mathematical notation, but you can imagine that the real version of this equation we’d have many many more polls all with their own Betas - after all, we’ll weight the NYT differently from the Washington Post to get the most accurate results. This is where you would be using a Sigma symbol (that E looking symbol below) to indicate that you’re iterating over a bunch of different polls.

y = Σ x * β

In fact, we could get even more specific. In this case we’re trying to get a y or dependent variable that is most accurate for 2024 (or whichever election we’re trying to predict).

Again, it sounds like a minor distinction, but it’ll be important to keep in mind for our first major problem with the election prediction.

Problem 1 - Non-Stationarity

Imagine we lived in a world where we didn’t know that you have a 50% chance of getting Heads when you flip a coin. I set out to determine the odds of landing Heads on any given coin flip. So I pull some datasets of how many times a coin was flipped, and how many times it turned up Heads. Generally speaking, if there’s a solid sized sample set - it shouldn’t matter if I’m pulling data from yesterday, 10 years ago, or 100 years ago. I will likely come to a model that puts the odds of Heads at around 50% because coins intrinsically will have a 50% odds of landing on either side.

Now we’re in a slightly parallel world. We still don’t know about the odds of getting Heads on a coin flip, but there’s one key difference: in our world 100 years ago the weighting on the coins were different. For whatever reason, the metals they used and the way the coin was designed, there was actually a 55% chance of getting heads on any given coin flip. Again, that was 100 years ago but our current coins are a true 50/50.

Now, instead of being able to use any data to predict a coin flip, we would have to be sure what kind of coin we’re trying to predict. Is it the century old coin biased towards heads? Or is it the modern coin that’s a true 50/50.

It sounds like a silly example, but it explains an important concept: stationarity in our predictors

Errors are stationary if they are generally the same over time. They’re non-stationary if they change over time. In our example, the weighting of the coin changed over time, so if we didn’t have something in our model to account for that, then we would have a model that performed poorly as a result. To make it clear, this is assuming we’re using one model that covers that coin from 100+ years ago when you were 55% likely to get heads but then the coin changed and now it’s a true 50/50.

Why Is US Election Prediction Non-Stationary?

Just like the coin example, when you’re thinking of non-stationarity you want to be asking: would using the same independent variables, or predictors, work equally well over time?

We assumed that polling data works for our purposes here. Part of polling is looking at various breakdowns of the polls themselves. Can we draw conclusions about who someone is likely to vote for based on their gender? Their race? Things of that nature.

In fact, this is a lot of what polling is. You’re looking at historical correlations between various demographic markers, and then applying those to the people who you ask who they’re going to vote for. If you call only women, or only elderly people, then you’ll be adjusting based on that. Finding representative samples of the electorate is a problem that continues to bedevil pollsters. Books can be (and are) written over how to optimize polling processes.

However, you might have realized what I’m leading to already: different demographic groups’ preferences change over time. It’s an extreme example, but if you were modelling US elections over time, and you had African American voters, it would be very important to account for the fact they were a reliable Republican voting bloc until they weren’t (say pre 1960’s if not earlier) and became a reliable Democratic voting bloc. But even further, the slim margins here matter quite a deal given how close presidential elections can be. That’s without even getting into the fact that all of these polls have margins of error or confidence bands built in.

The main point here is that various demographic voter preferences change over time, and that makes polling accurately (and thus prediction) difficult.

So how do we deal with non-stationarity?

Well keeping up with the coin example, what we’ll likely do is insert an additional variable. So what we’ll use is called a Dummy Variable. It’s just a basic binary variable that is often either 0 or 1. In this case, we would look to see if our coin is from the 55% era, or the 50% era, and use that dummy variable to express that. You could imagine it being 1 for the 55% coins and 0 for 50% coins, which would inform our model which probability distribution to use.

Using dummy variables or some other way to allow the model to account for the fact that preferences change over time is one way to deal with stationarity in models, if it breaks down cleanly like above.

Unfortunately, I don’t think it would be so clean using US elections. So we are forced to move to a different method to account for the non-stationarity of demographics’ polling preferences.

Another way to deal with non-stationarity is to simply zoom in to the data so you are covering a time period where the underlying data is stationary. Using the coin example, we would simply just model the likelihood the coin lands on heads since they moved to the 50% coin design.

Perfect, we’re going to zoom into more recent data. It’s a pretty difficult problem if we should be doing say, 1960-present, 2016-present, or whatever period. In fact, part of that problem with figuring out where our data cut-off should be is related to our next problem.

Problem 2 - Low Sample Size with Low Frequency

I began this piece talking about how one of the defining features of modern society is how much data we collect. Generally speaking if you ask a statistician or data scientist they’ll tell you: more data is better. Even if data seems “useless” right now, it’s possible that it could be useful in the future so collect it anyways. There are actually quite a few techniques in statistics and modelling related to removing data that is useless, or even worse, damaging to predictions. In fact, in our AI age, the selling point of many models is how many parameters they use and how large their datasets are.

When it comes to presidential elections, we have had 59 since our countries founding. But saying our sample size (henceforth referred to as N) is 59 oversells the data quantity. I imagine that the insights we could mine from the 2020 election are likely much more valuable than the 1796 election where women, non-white men, and even non-propertied men (often) could not vote.

The problem with small sample sizes is simple: it makes it much harder to deduce general characteristics of the electorate. Here’s a good example: George W Bush (43) actually did well with Muslims in 2000 (there’s conflicting data, but some estimates have him with 40%+ of the Muslim vote). He also did quite well with Hispanics. Donald Trump in comparison did rather less well with both of those groups.

We only have a presidential election every 4 years, and are forced to draw sweeping conclusions from those results. One way to get more data to predict presidential elections is to use results from other elections, and trying to establish the correlation between that election and the presidential one. For instance, pollsters are generally going to recalibrate their models based off of how midterm elections go. This serves to give us more data, but simultaneously this data can be misleading as it will not correlate perfectly to the presidential election. You can imagine that a particularly popular governor wins an election against a particularly bad opponent with a larger than expected margin. How do you account for that governor’s popularity and their opponent’s unpopularity? So this data is not useless, but also must be treated skeptically in the sense of wanting to not assume it will directly translate3.

Ideally, from a pollster’s perspective, if we had presidential elections once a week or so, we could likely predict elections with a high degree of confidence. We would be able to zoom in so that our data is stationary, and we would have enough instances to be able to sus out any changes in a demographics’ voting preference and how that translates between polls and actual ballot boxes in real time.

Unfortunately, again from the pollster’s perspective, we only have elections once every 4 years. So we get new election data and we update our models for the various race, age, gender, etc. splits. In particular, we will look at where our previous predictions failed and we’ll look to correct them. Alas, that leads us to the final error we’re covering today.

Problem 3 - Autocorrelated Errors

We’re almost at the end, and luckily I think this one might be the easiest to explain now that we’ve gotten into the nitty gritty above.

As I said before, we’ll look at what our models predicted, and then compare to what the actual election results were. If we missed by underestimating Republican strength, then we’ll likely be adjusting our polls to so that future iterations will capture the Republican part of the electorate better. It’s the exact same with Democrats; if we overestimated their voter share with the elderly, then in subsequent polls, we’ll assume the elderly are less likely to vote Democratic (than we previously assumed).

This means that our errors for our current prediction are likely to be the opposite of our errors the previous time due to over-correction. We saw this in 2016 when the polls underestimated Trump, and then 2017-2020 the polls (generally) overestimated Republican strength. It makes sense as pollsters were trying to correct their previous error, but ended up overdoing it.

How do we deal with this?

We use an Autoregressive model. If you’re not familiar with this, no worries and in fact the name is fairly self explanatory. Auto in this case means self, and regressive just means you’re looking at the past results. So basically what we’re doing is looking at where we had errors last time, and then assuming there will be a correction based on that in the current period. Over time we’ll ideally have enough data (including errors) that our model can minimize how those errors affect our predictions.

However, we run into a similar problem as to our demographic non-stationarity: we simply don’t have enough elections to have confidence that we’re properly correcting for the errors, even as we use our autoregressive model. As a result, it will be more difficult to accurately anticipate how large the errors will be, and how to correct for them.

This post in many ways could have been: we don’t have enough elections to reduce the margin of error in election polls as much as we would like to

A Few More Problems to Think Of

We covered a lot here, but this is a topic that could be written about endlessly. Here are a few other things I didn’t cover here in depth that are also going to be potential issues with polling.

Turnout. Even if we modeled perfectly how each demographic will vote, if we don’t do a good job predicting who will actually go to the polls and vote, then we’ll be liable to have errors
October Surprises. Does something happen right before the election that has the potential to shift voter preferences or turnout that wasn’t previously expected? A good example would be the James Comey letter on Hillary Clinton’s emails coming out right before the election in 2016. In fact, there were arguably two October Surprises in 2016 given the Access Hollywood tape that came out on Trump. What a crazy election!
Third Parties. For good reason they tend to receive less attention when calibrating models, but the breakdown in who votes Third Party and who ends up “returning home” to a major party, so to speak, can prove decisive in states won on a small margin

Conclusions

What should we make of all this?

I recognize that a lot of what I said above makes it seem like predicting elections is a fool’s errand. This may come as a surprise then but I think that by combining a bunch of polls together that we actually can end up with a decent snapshot of how the election might play out. In fact, despite all the issues, even in elections like 2016 which greatly “defied the conventional wisdom”, polling reflected that it was expected to be a tight election. In fact, I recall seeing the Wisconsin polls narrowing consistently in the leadup to November. If you add in the fact that all of these polls have a built in margin of error or confidence interval they start to look less bad. We are accustomed to seeing that Trump is 3 points up, but to be more accurate, it ought to be expressed as “in the latest poll Trump is 3 points up with us being confident that if you add or subtract 5% to that +3 it would cover almost every conceivable scenario with some basic but justified (by data) assumptions.”

That’s a mouthful, but all these polls have what we call confidence intervals or margin of error. Those convey that with it’s actually a range of outcomes they predict. Furthermore, that range of outcomes won’t cover literally every scenario but rather will cover say 95% of scenarios. Those “scenarios” furthermore are based on built in assumptions. Those assumptions are trying to apply all of the above lessons we have covered such as adjusting for the latest elections, trying to correct for prior errors, etc.

Suddenly when we see that these predictions are actually based on lots of assumptions and made with models we know have built in shortcomings, we start to see that it actually is kind of impressive that we’re able to predict elections this well.

That leads me to my position: there is going to be a degree of uncertainty inherent to election forecasting. Someone “getting it right” might be a credit to their rigorous modeling and great assumptions. Maybe they correctly had the right set of data where voter preferences were relatively stable, and the pollster’s correction was handled correctly. It also might be based on luck. In either of those scenarios, getting it right once is hardly a guarantee of getting it right in the future. In this sense, paying attention to whether a modeler is open about their shortcomings and assumptions becomes a credit to them, in my mind.

In fact, I’ll lay my cards out on the table here: this genesis of this post was in decent part due to the proliferation of people doing poll aggregation models as amateurs. I have no problem with folks doing open-source work with data and this blog is based on that. However, I have seen far too many folks have one decent model or result and suddenly thinking it means they know how to win elections. To be very clear, modeling how the electorate will vote and what could change those voting patterns are completely different questions and success in one by no means guarantees success in the other. That goes back to one of our first rules: know what question you are trying to answer.

So that’s it: making an accurate model of who will win the presidency, and the breakdown of votes that leads to that is extremely hard. There are lots of inherent difficulties to this modelling, and at the end of the day the best we can do is acknowledge these are models that are inherently providing a range of outcomes based on assumptions. However, if we stay honest and humble, we also have the opportunity to provide a generally correct view of the election.

In fact, it’s not just correlation we want but rather predictivity. Two data series could be highly correlated with each other, but not predictive of each other. That said, this is meant to be a light introduction and getting excessive in applying all the caveats can serve to undermine that purpose. But if you’re reading this keep in mind that correlated data sets don’t always lead to predictive insights.

Technically it would be ŷ or what is commonly called “y hat.” Putting the little hat on top of any letter means that it’s a predicted value. Again for simplicity sake I am going to refer to it as simply y, but you’ll see this in statistics often.

I used midterms as an example, but special elections and things of that nature also go into it too. It’s more data which is good, but it’s also data that you might have more trouble properly calibrating what weight it should be given. After all, you probably don’t have much precedence for any given special election, even if the country is large enough that they’re fairly regular

Ozeki Analytics

Discussion about this post