If we zoom in on the front page of the New York Times from Friday October 30, 1896, you can find the latest wagers on the upcoming election, “BETTING MORE LIVELY.; Col. Swords Placed $3,000 to $1,000 on McKinley.” In other words, good money had McKinley at 75% to win the upcoming election (as you no doubt know, he did win). We cannot speak to the saliency and numerosity of readers of the New York Times in 1896 about gambling odds, but we are confident that they would confuse most readers in 2020.

A lot happens in the next 124 years of election forecasting. Gambling evolved from wagers to set prices on the main financial exchanges, back to wagers, and then finally onto legal (and less legal) prediction market exchanges. We can embarrass ourselves by highlighting the work of one of the authors of this piece, David Rothschild, who spent a lot of time understanding how to convert prediction markets prices into clearly digestible probabilities. Here is a screenshot of his site, PredictWise, in 2016. You will notice that it peaked at 89% for Clinton on the night before the 2016: she lost. More importantly for this piece, PredictWise focused on tables and charts to highlight the probability of an event occurring. This is straightforward for someone who understands probabilities, but does not do much to convey the true value of 89% or 11%

In 2020 there are two main prediction teams fighting it out with their mixed models of fundamentals (i.e., models built on economics, incumbency, etc.) and polling data: Nate Silver, ABC pundit and the FiveThirtyEight publisher, released his forecast to great fanfare on August 14: it had Biden at 71% to win. Notice how Silver moved past gambling odds (3 to 1), past tables and charts (both on the site, but not the leading visualization), into probabilities designed to look like frequencies (where 100 dots equal 100 random draws of possible outcomes). While there are not actually 100 versions of the world, Silver displays 100 versions to provide a salient illustration for the viewers.

Silver actually came from the world of understanding baseball, where, built on the ground-breaking work of Bill James, a lot of data makes it relatively easy to predict outcomes. You can see how these probabilities have been incorporated into baseball games, with this screenshot from ABC’s sister-station ESPN using probabilities that update all game long.

On the other hand, the Economist model, backed by Andrew Gelman and G. Elliot Morris, allows the user to toggle between the histogram and the frequency plot, which provides a similar graphical experience to Silver. Not fully germane to this piece, The Economist model is a more pure statistical model based on historical outcomes, while Silver added a lot of artificial noise in order to reach a lower starting probability for Biden. Gelman et al. 2020 provides a solid breakdown of the differences. Note on the day that the Silver model was launched at 71% for Biden, the Economist model was at 87%.

You will notice that PredictWise and David Rothschild are not really in this game any more. That 89% at the end for Clinton does not bother any of us that much: we don’t know the true underlying probability of Clinton winning the election under various circumstances. Besides, David can still brag about this prediction with Gelman that Trump would carry Florida by one, and his MSN and PredictWise polls that showed Trump carrying the upper-Midwest.

Instead, it is horse-race coverage in general that bother all of us.

Is horse-race coverage a substitute or complement to substantive news about politics and the election? It is possible that horse-race coverage may excite people about politics, driving them to learn more. But, more likely, with the very low amount of news that most Americans consume: it substitutes for more substantive news. Even worse, it is not like horse-race coverage is a harmless diversion, it may actually demotivate some voters, if the race is not expected to be close. So, what is next?

First, David and his colleagues (and co-authors of this piece) at MSR and PredictWise, now focus more on polling and passive data (like application or location data) as it helps stakeholders reallocate resources. While the occasional tweet will show where their polling data on horse-races are, especially if it is quite different from the general consensus (not too many other people saw Biden competing in Georgia, Ohio, Iowa, and Texas in July).

Second, we would rather use polling data for a much more important task: showing where there is a big disconnect between elites and the general public. From 2010 through 2016 that was trying to get anyone to understand how popular universal healthcare was. Now it is trying to get elites to understand how popular investment (and regulation) in green technology is.

This piece is not about 2020: it is about 2022 and beyond. We have a dream to radically transform how people consider the probability of outcomes, by reminding everyone that these probabilities are not deterministic. Elections are not a spectator sport. The estimated outcome of the election depends on the work that you do, or do not, in the days, weeks, months leading up until the election ends.

We have created a fully interactive experience where the probability is not noted in a table or chart, or visualized as a frequency, but experienced as you shift the outcomes of various demographics, and that changes the overall outcome.

Please note the data below is not a prediction, do not reprsent our best estimate of the current or future vote, but represent a lower powered survey for demonstration purposes.

Interactive Electoral Map

Decide Voter Turnout

Drag the bars for Biden and Trump vote share to change outcome of the 2020 election.

National
100.0% of all voters

2020 US Presidential Election Forecasted by You

Electoral votes by state, ordered by Democratic Vote Share

We start with a map. This map is like other visualizations published before the election with an estimate of the vote share in each state and colors shaded to show who is the likely winner.

Here we hover over Texas to see Biden at 51.1% of the two-party vote share. We have a version of this map that starts at the 2016 election, but here is a taken from a recent national poll. We do not have probabilities built into this map, instead we allow the user to explore the possible outcomes.

We allow users to select any combination of the 6 non-geographical demographics (Race, Sex, Education, Age, Urbanicity, and Party) and change the percentage of voter turnout and two-party vote share. This could look like selecting "White Women" and increasing the percentage that vote for Trump until the map flips to Trump. Note that as the percentage of white women voting Trump increases, Trump’s percentages increase across the country.

For 2022 we will havd a map like this, that captures all 435 house districts, and all senate races. We will allow users to explore various versions of maps as the once-per-decade redistricting takes hold. The goal is to push the boundaries of the academic research in mapping election outcomes, while also exploring unique interactive data visualizations that remind voters that politics is not a spectator sport but something in which they are participating, and creating open source tools for both journalist and practitioners to use to both educate and allocate resources.

Go on and explore!

Methodology

The algorithm for reallocation works as follows:

For a given demographic combination, we compute the 'national average estimate' by taking the weighted average of each vote category (Not Voting, Democrat, Republican) across each sub-demographic. For example, the average of “white” will be the weighted average of “white men'' and “white women”. We display these percentages rounded to the 1st decimal place to the user. This is the value we start with before the user does anything. We determine weights from a poststratificaiton file that is derived from the voter file, to reflect the probabile size of any given demographic combination in any given geography.

A user can then change the two-party vote share until 100% of the voting demographic is accounted for by either party. By reducing the vote share of a candidate, we assign those percentage points to "Not Voting" rather than to the other candidate.

We then take the difference between what the user creates for the vote for a given demographic category and the national average estimate. For example if “white women” starts at 29% for Trump, but a user moves it up to 50%, we need to find an extra 21% for Trump in this demographic cluster. We uniformly add that difference to each of the sub-demographics vote shares of “white women”. We do not allow a vote share to fall below 0% or exceed 100%. We repeat this process until the difference between allocated vote share and computed national average is below some threshold 1e-5. We then compute which party received the majority of votes for each state. If Trump wins 50% or more votes of the state, we will assign the state's Electoral College votes to Trump.

We currently allow users to aggregate shifts across multiple demographic groups and see how the map changes. For example, a user can shift "white women" and then "suburban Hispanic men" and both changes would be reflected in the map.

Future Work

So why is this about 2022 and beyond: we are going to work hard over the next few months to launch a version of this map for the 2022 congressional elections. We want you to be able to understand not just what needs to happen for various parties to win, but how it represents the US with gerrymandering and redistricting. We want to create more stories like this that illustrate how these choices evolve, and impact you.

We are looking for computational social scientists interested in joining us on this journey.

  1. Candidate Choices: We currently model 3rd party candidates as a lack of vote. This could be extended to include third party candidates, but there is no competitive 3rd party vote in the 2020 election
  2. Geographic Choices: We do not include Maine and Nebraska's separate EC votes. Or, obvisiously, and other sub-state districts.
  3. Model: We model vote intention based on demographic groups. We do not intend to claim that individuals vote solely according to their identity categories. We do not include state correlations directly into the model, but instead these correlations are learned through the demographic similarity of states by the Multilevel regression with poststratification (MRP) algorithm. We could certainly add complexity, such as fundamental factors, at some point in the future, but this model is reasonable.
  4. Probability: We allow users to select "improbable" allocations (like 0% turnout or 100% turnout). In the future, we will provide visual clues to the probability of choices by the user.
  5. Path Dependence 1: If we raise the vote share of a demographic group to "1" or "0", then we expect all sub-groups to have 1 or 0 vote share. If no other changes are made, then all subgroups will continue to have equal vote share (we will have lost the information that each subgroup has relative different rates). We do want to build in path independence as much as possible.
  6. Path Dependence 2: Changes to demographics are not symmetrical. For example, if we change white and then women, that should have the same effect on white-women as changing women and then white. It should be equal, if no sub-groups go beyond constraints.
  7. Uniform Allocation: We uniformly allocate the difference in allocated vote share to each demographic. To be clear: when white women increases for Biden, we increase each sub-group, like white urban women and white suburban women, uniformly. Alternative options: allocate according to "greatest mass". i.e. sub-demographics that are more likely to vote Democrat would receive more for new allocation of votes; allocate according to "swing groups". i.e. sub-demographics that have a closer race (closer to 50/50 split) would receive more of the new allocation of votes.