What Are Patterns For? Big Data and Its Discontents

Suzanne Scala
February 12, 2013
A pattern of scattered cubes.

Around the end of each year, pundits play the parlor game of choosing a “word of the year.” Often, it’s some flash-in-the-pan phrase that seems to capture the zeitgeist. This year’s contenders included “YOLO” and “Eastwooding.” (Visit the Boston Globe for a roundup.) For Geoff Nunberg, Professor at UC Berkeley’s School of Information and pundit in his own right at NPR, that word was “big data.”

Indeed, in certain circles it’s hard to visit a web page without seeing a reference to this new buzzword. We’ve blogged about big data several times: on the occasion of the DataEDGE conference, on big data’s relationship to civil rights, and with regard to the election. Professor Marti Hearst, also at the School of Information, recently even taught a course called “Analyzing Big Data With Twitter.” While the term is difficult to define, we have previously written that it “describe[s] info sets so large and so complex that available database management tools cannot handle them.”

So why is all this number crunching interesting people in the humanities? Big data is just the domain of statisticians and the people interested in their analyses, like sociologists and marketers, right?

Not so fast. Enter bona fide humanists like Franco Moretti and his team at the Stanford Literature LabMoretti argues that “close reading,” the traditional bread and butter of literary study, makes it impossible to see the big picture. How useful is reading, say, 100 Victorian texts closely when there were thousands and thousands published during that era? Better to let a computer read them, he says, and then analyze the results. Moretti is talking about mining the big data of Victorian novels for trends in the same way that marketers look at sales data to figure out how often people who by peanut butter also buy jelly.

And Moretti isn’t the only scholar trading the bound tome for the keyboard. A recent New York Times article outlines this new literary trend. At Harvard, Jean-Baptiste Michel and Erez Lieberman Aiden are poking through Google Books to trace word use over time. Based on how frequently references to Freud appear, Michel and Aiden determine that, as the Harvard Gazette says, the famous psychoanalyst “is more deeply ingrained in our collective subconscious than ‘Galileo,’ ‘Darwin,’ or ‘Einstein.’” I supposed the subconscious is Freud’s turf, after all.

Unsurprisingly, literary critics’ use of big data has drawn the ire of close readers, traditionally proponents of, shall we say, “small data.” A close reader might notice, for instance, that the term “big data” is silly, since data is a mass noun. It's like saying “big grass” if you’re talking about a lot of grass. Stanley Fish, a particularly celebrated close reader, takes issue with scholars like Moretti, arguing that the “data mining” approach to criticism takes away the ability of the critic to have a hypothesis.

Even in the realm of the “hard sciences,” researchers acknowledge the limitations of the computer when faced with complex literary texts. Padraig Mac Carron and Ralph Kenna, scientists at the University of Coventry, entered information on books like Beowulf and Les Misérables into computers by hand in order to study the social networks created in the texts. According to Inside Science, they found that entering the data in this way was “more effective” because a computer would have a hard time distinguishing between “friendly” and “unfriendly” relationships. While the machine can find a relationship, it takes a human reader to understand the meaning of it.

Fish, Mac Carron and Kenna are pointing to a major criticism of “big data” methods in general: how can we determine the value of the trends the computer can reveal to us? Is it important that the Gothic writers use the word “the” more often than people from other periods, or is it just a meaningless quirk? Do more frequent references to Freud in the Google Books archives necessarily mean that he has a larger place in our culture than Darwin? As Nunberg concludes on NPR, even with these new analytic tools, people in all domains  “will still feel the need to sort out the causes from the correlations—still asking the old question, what are patterns for?”

[Image Credit]

Suzanne Scala is a Graduate Student Researcher at the Townsend Center for the Humanities. She is pursuing a Ph.D. in Comparative Literature.