Big Data, Big Risks: Google Search Data and Election Predictions

Rochelle Terman
October 31, 2012
Diagram with arrows explaining how to use Google Trends.

It’s that time of year again: the debates are over, months of campaigning have passed, and candidates have spent nearly every last contribution dollar towards ads. With less than two weeks before the election, the attention shifts from candidates, speeches and debates to polls, electoral maps, and strange forecasters meant to predict our next President.

Thus we’ve reached the point of campaign season when mainstream pundits step aside and the nerds of political forecasting relish in their 15 minutes. The King of this cadre is Nate Silver. His amazingly successful blog FiveThirtyEight has become the posterchild of this kind of political/statistical wizardry.

But Silver focuses on polls conducted via interviews, which is a problem because people are reluctant to talk to pollsters about sensitive topics, such as whether or not they will vote. As many as two-third of people who end up skipping the polls tell pollsters that they will.

This is not just a problem with election data. Topics such as racism, drug dealing, child abuse are extremely difficult to study in a systematic way because people are reluctant to talk about them. This leads to biased surveys and inaccurate measurements.

Economist Seth Stephens-Davidowitz is taking another approach to get around this problem: big data—Google search data, to be precise. We’ve discussed Google’s data-making abilities before. But despite its potential, says Stephen-Davidowitz, scholars have been slow to exploit it:

"Despite the ubiquity of Google searching, and searchers’ demonstrated willingness to share their true feelings and unbridled thoughts on Google, what Americans are typing when they search remains surprisingly underutilized in political analysis. But Google can often offer insights unavailable elsewhere."

Google search data is powerful first and foremost because of its sheer volume: Almost everyone uses Google, and there is nothing to suggest a selection bias—that is, that democrats are more likely to use Google and republicans Bing or Yahoo.

Second, Google search data can organize searches spatially and temporally, giving us great insight into what folks were interested in last Saturday in Ohio. (The answer is football, always football.)

The problem that presents itself is the same problem that pesters all statistical work: correlation is not causation, and big data cannot interpret itself.

For instance, Stephens-Davidowitz notes that the states in which Google searches for voting information (e.g. “how to vote”) were higher in 2008 than they were four years earlier were overwhelmingly the states with some of the highest African-American populations —North Carolina, Georgia and Mississippi. In other words, the higher the percentage of African Americans, the bigger the increase in voting-related searches. Stephens-Davidowitz concludes: Google search data would have made the unsurprising, and ultimately correct, prediction that black turnout was going to be substantially higher in 2008 than it was in 2004.

Indeed this particular prediction happen to pan out, but anyone who has ever heard of “aggregation problems” would immediately balk at this conclusion. To see why, consider another study done in 1968: Relying on strictly aggregate data, social scientists found that the percentage African American was the strongest correlate (or “predictor”) of the vote in support of George Wallace’s presidential candidacy at the congressional district level. In other words, the higher the percentage of African Americans, the more likely that district was going to go in support of George Wallace.

Clearly African Americans were not rallying around George Wallace, proponent of segregation and enemy of civil rights. What this effect actually demonstrated was a “contextual effect, viz., the greater the concentration of blacks in a congressional district, the greater the propensity of whites in that district to vote for Wallace.” (Schoenberger and Segal 1971). So it was not that African Americans voted for Wallace, but rather whites who lived near a high proportion of African Americans were more likely to support Wallace. Similar critiques could be made of Google data, or at least suggests caution in using this data to make predictions.

Perhaps the biggest potential for Google search data is not its predictive quality but its suggestive quality, giving us insights into trends that we otherwise would not have noticed. To see this in action, simply go to Google Trends and play around with it.

To back up my Ohio-football conjecture, take a look at my Google Trends investigation:

Tell us what you find out using Google Trends!

Rochelle Terman is a Graduate Student Researcher at the Townsend Center for the Humanities. She is pursuing a Ph.D. in Political Science.