Down the Data Mine: Digging Much Deeper into Google Books

By Jeff Rogers

Image showing data crunch technology being used to research terminology describing beautiful women since the 1810s.

Looking at the massive data bank that is the Google Books digital collection, Professor Mark Davies (Linguistics, Brigham Young University) saw an infinitely rich but analytically under-served resource, and the recent release of his new Google Books: American English interface for performing an impressive array of linguistic searches on that 155 billion word data bank will do much to enable the sort of rich social and cultural linguistic analysis that Davies envisioned. The interface is hosted on the BYU web and is not itself a Google product. On the contrary, the powerful Google Books: American English interface represents the liberation of a Google product.

By way of introduction, Davies writes on the site that "just because we have 'issues' with the simplistic Google Books interface doesn't mean that we don't really appreciate everything else that Google Books has done in creating the underlying data." Indeed, Davies might be presumed to have a very keen appreciation for the Google Books project in that he has first recognized and, with the creation of the new interface at BYU, now tapped into the real potential of the data set. What he has created, essentially, is a tool that allows for much more varied and advanced searches than does the original and more simple Google Books interface.

The site itself offers an illustrative "five minute" guided tour to introduce users to the capabilities of the interface, but some of the key relative strengths of the tool are easily enumerated (Davies offers a longer explanation here). The standard Google Books interface is an excellent tool for turning up books containing a word or exact phrase of interest and for settling the sort of arbitrary argument (over the historical frequency of the word "avast" in print, for example) that a pair of etymologists or historians might be imagined to have in a pub. But if that argument were to be more complicated or subtle or if the sparring scholars wanted to do something outside of Google Books with the data returned by a query, they would find themselves frustrated with the simple interface.

This is where the Davies-designed version steps in. The new interface allows for complex custom queries identifying terms as specified parts of speech, for example, or as synonyms or as used in proximity to other specified terms. This flexibility opens a window onto whole new types of analysis, letting users track patterns of usage and seeing how these have evolved over time. The new interface also enables historical comparison queries--defined anywhere between 1810 and 2009, allowing, for example, a comparison of adjectives used to describe music in two distinct periods (pre- and post-Elvis, say).

What's more, Davies has plenty of expansions coming down the pipeline for a project that already offers analytical functionality going well beyond what was previously possible with Google Books and Culturomics. Kudos to Davies for his excellent work, and here's to the future of large-scale linguistic data mining and what promises to be a bumper crop of interesting findings.

Multimedia

Down the Data Mine: Digging Much Deeper into Google Books

Down the Data Mine: Digging Much Deeper into Google Books

TOWNSEND CENTER FOR THE HUMANITIES