What Can Culturomics Do for the Humanities?

What Can Culturomics Do for the Humanities?

Image of a computer with the text, This is not a book, written in French.

Last Thursday, Google launched a powerful new tool: the n-gram viewer, which allows you to search a dataset of the more than 500 billion words contained in roughly 5.2 million of Google's digitized books published in Chinese, English, French, German, Hebrew, Russian, and Spanish between 1500 and 2008. The data can also be downloaded here, which means that you can write your own programs and manipulate it in more complex ways than the n-gram viewer allows. 

Of Google’s more than fifteen million digitized books—a number that, according to Science, represents about 12% of all published books—those selected for inclusion in the n-grams dataset were chosen because of the quality of their optical character recognition (OCR), and because of their metadata. The data is strongest between 1800-2000 for books in English.   

The breathtaking new tool for scholars, which will allow everyone to dabble in "distant reading," is the subject of the third article in New York Times writer Patricia Cohen's "Humanities 2.0" series, which the Townsend Humanities Lab has been following with great interest.

Accompanying last Thursday's n-gram viewer launch was the Sciencearticle "Quantitive Analysis of Culturing Using Millions of Digitized Books," co-authored by Jean-Baptiste Michel and other researchers based mostly at Harvard and Google, which heralded the birth of a new field of research enabled by the dataset: "Culturomics."  In the article, Michel et. al. outline some preliminary findings in Culturomics: they demonstrate the increasingly short life of fame, and chart censorship and repression by tracing precipitous drops in the use of some proper names. "Marc Chagall," for example, virtually disappears from the German corpus during the Nazi period. 

Another noteworthy discovery set out in the Science article is the vast amount of "lexical dark matter": even after excluding proper nouns, the researchers found that more than half of the words contained in the English corpus do not appear in any published dictionary.

This discovery may be suggestive not of the undiscovered riches of the English language, but of a weaknesses in the dataset.  Natalie Binder, a graduate student in Information Studies at Florida State University who offers some initial criticism of the project on The Binder Blog, points out that computers often can't recognize the difference between, for example, an "rn" and an "m."  This means that the dataset is riddled with errors, as the search for “beft” and “best” conducted by Alexis Madrigal of the Atlantic  demonstrates: Until about 1800, "beft" is extremely common, while "best" is surprisingly rare; after 1800, the reverse is true. Madrigal's findings suggest not some change in the superlative form, but rather OCR's limited understanding of the history of typography... 

A second major weakness of the tool--that there is no way to get from distant to close reading--is also the condition of its possibility: Google is currently facing a class-action lawsuit by writers and publishers who claim that the dataset represents a copyright infringement. Google's defense rests in the fact that it is not releasing the full text of any book, but only derivative n-grams, or strings of words.  (A 1-gram is one-word, a 5-gram--the longest "gram" in the dataset--is a five-word string, like "The United States of America.")  

On his blog, Dan Cohen, Associate Professor of History and Art History at George Mason University, also reviews the new tool, and the nascent field of Culturomics--offering careful enthusiasm for the former, and the observation that the latter sounds an awful lot like an 80s new wave band... Regardless of the weaknesses in the dataset, the n-grams search engine is an exciting new tool for scholars, and it will be interesting to follow the findings it enables--and wether they indeed develop into a new field of "Culturomics."  

Image Credit: "This is not a book" by Natalie Binder. Source: The Binder Blog.