Big Data's Big Horizons

By Sanjay Hukku

Three-dimensional image of towers consisting of green ones and zeroes.

Two weeks ago the UC Berkeley School of Information hosted the DataEDGE Conference. This conference structured itself around a looming question--one very much on minds right now: "How will your business or organization face the challenges and opportunities of Big Data?" Put another way, this conference deals with numerous attendant issues raised by the specter of Big Data: "Understanding how to manage the transition to a data-intensive economy will make or break companies, careers, and fortunes. As the web grows increasingly dense, the information about our connections accumulates logarithmically. Analysis of this data can reveal who we are, what we like, and how we act."

To put this in perspective, "Big data" is a loosely-defined term. Generally, it is used to describe info sets so large and so complex that available database management tools cannot handle them. Difficulties "Big Data" presents include hassles in the capturing, storing, and sharing of large information sets; additionally, even if securely stored, "big data" presents further complications in searching, visualization, and analysis.

However, despite these problems, there has been a noticeable trend in numerous areas of inquiry to larger data sets. It is primarily due to the additional information derivable from analysis of a single, large set of related data when compared to analysis of separate, smaller sets with the same total amount of data. Analysis of larger data sets allows for a wider perspective and, because of this, greater insight into trends and patterns.

At the DataEdge Conference, I had the pleasure of attending a panel titled "Size Matters: Big Data, New Vistas in the Humanities and Social Sciences." The discussants were Mark Liberman, Professor of Linguistics and Computer & Information Science at the University of Pennsylvania, Geoff Nunberg, Adjunct Professor at UC Berkeley's School of Information, and Matthew Salganik, Assistant Professor of Sociology at Princeton University.

Numerous interesting concepts were discussed, and, though the panel leaned heavily towards the social sciences, several points were made that have implications for the Humanities.

Both Professor Liberman and Professor Salganik, discussing Big Data's value, advanced the claim that it is not inherently useful, but what it makes way for can be. In other words, Big Data will not replace other fields of inquiry and thought, but, if integrated with them, it can better them. For example, better and larger data sets will allow researchers to better anticipate the patterns or needs of a population in, for example, a flu outbreak, but they will always be, at best, predictive.

Professor Salganik said, "I think the 'bigness' of Big Data is not the most exciting thing to me; I think it's also that the data is better than the data we had before--at least for the problems Sociologists are interested in." Why? For one, "since it's being recorded automatically, it's often longitudinal [as opposed to cross-sectional]." And, "It is also exciting because it gives us [Sociologists] new questions to ask."

To illustrate these points, Professor Salganik discussed research being done by one of his grad students. "One of the things Sociologists have studied for a long time" he explained, "is residential segregation: Where is it that people live? How can you measure how segregated cities are? How does that change over time? What are the consequences of this for individuals?" This has been an interesting body of work, Salganik claimed, but, with Big Data, we can ask new questions that further expand and shift these inquiries. Studies of residential segregation arose, he explained, because we had information about where people live. But now, with the new possibilities of Big Data, "we can have data on where people are, so we can study segregation in space and time." In short, "The ability to do new kinds of measurements raises new kinds of questions."

Indeed, any number of developments in Digital Humanities deal with just this capability of larger data sets. Professor Nunberg, discussing Google Books in order to bring up "the Humanites side," explained, "I'm interested with the way vocabularies emerge. A whole set of words will emerge at the same time and increase in frequency together." To show this, he explained how, "In the 1970s, you got all these words taken from marketing and applied to social class. 'Demographic,' which was this obscure, technical adjective that had been in the language since 1880 suddenly shoots up in frequency--and 'lifestyle' and 'yuppy' and...the plural, 'demographics,' appears...If you plot it in Google Books, you can see all of these words increasing in frequency at at the same rate and look at the unit of analysis as vocabulary instead of words."

In this way, Big Data provides Humanists the ability to glean new information and new insights from larger corpuses of work. Nunberg went on to mention the work being done by Franco Moretti and other scholars at Stanford's Literary Lab. These scholars are asking new questions only possible because of Big Data. These questions have been grouped under the umbrella term "distant reading."

This is not to say that the era of close reading is over. Big Data, as one can imagine, requires big investments. When working with terabits of information (or larger), execution costs can be staggering. And, given the newness of inquiry into such large data sets, there are risks attached to any endeavor.

But, given all of this, the consensus was that Big Data holds the potential for big rewards, and we are just beginning to see its effects.

Sanjay Hukku is a Graduate Student Researcher at the Townsend Center for the Humanities.

Multimedia

Big Data's Big Horizons

Big Data's Big Horizons

TOWNSEND CENTER FOR THE HUMANITIES