Pages

Dec 17, 2010

Thanks, Google Labs! - Books nGram Viewer

(Updated January 24, 2011)
Google Labs has quietly introduced a new tool for linguists, historians, trend-followers and literary dilettantes with the ungainly name, "Books nGram Viewer."

The tool allows a user to plot the frequency of words (or phrases) in Google's vast archive of scanned books from 1800 through 2008.  The left axis plots the frequency of occurrence of each word or "n-gram" (a two-word phrase is a "2-gram" a three-word phrase is a 3-gram and so on) as a percent of all the n-grams in the corpus.

Here is a plot (click to enlarge) of some of the most important media technologies in the last 150 years, smoothed with a 3-year moving average, restricted to books published in English.  The graph nicely illustrates the typical S-curve diffusion of technologies, or more accurately, the general awareness of technologies.  Interestingly, the fastest ascending curves are those for "radio," "computer," and "Internet."

It's also interesting that "computer" seems to have peaked, while "Internet" is approaching the popularity of "television."

(Click image to enlarge)
Click here to view in Google n-Gram Viewer

As useful and fun as this tool is, it currently has some important limitations, which one must consider before drawing conclusions about its output.  For starters, Google's book-scanning project is a work in progress, hence some "trends" may reflect the types of books that have been predominantly scanned for one period or another.  So if the proportion of science books that have been scanned varies widely from one decade to another, the frequency of technical terms and scientific subjects might appear to be changing.

Another limitation is that the n-gram search tool is case specific, so "Internet" and "internet" return vastly different results.  It's a good idea to check both forms of capitalization.  In addition, the tool does not yet support wild cards or stem searching.  If I want to track the diffusion of the concept of the telephone, ideally I would want a search term like "telephon*," which would identify occurrences of "telephone," "telephones," "telephony" etc.

The tool also lacks the ability to group synonyms.  The above graph would be more interesting if I could combine synonyms such as "wireless" and "radio" or "television" and "TV". (Indeed, the decline in "telephone" is mirrored by a similar increase in "phone" beginning in the 1940's and accelerating after 1960).  One hopes that Google will add this feature, or allow users to download the data underlying the graphs to further manipulate the data. (Note 1).  Until this feature is added, the utility is most useful for tracking the evolution of synonyms over time, or for comparing the use of precise, technically defined terms.

So in true Beavis & Butt-head tradition, I feel compelled to offer this plot below of "penis", "vagina" and "clitoris."

This one took me partially by surprise.  I expected that the clitoris might have received relatively scant attention compared to the primary sex organs, but I'm struck by the 80-year surge in relative popularity enjoyed by "vagina" in the mid-nineteenth century.

Any scholars of Victorian erotica out there who can shed some light on this one?


Update (January 14, 2011)

The Victorians are absolved.  Further analysis shows that the relative popularity of the vagina in the English language corpus is almost entirely a colonial affair.

Here are the separate plots for British English and American English.

British

American

The scaling on the two graphs is very nearly the same (another feature request: user-defined scaling, please!) so clearly the higher frequency of "vagina" occurs mostly in the collection of U.S. English titles.  Interestingly, there is a noticeable, sharp spike in the relative frequency of "vagina" in the British corpus that begins almost immediately after Queen Victoria's death in 1901.

Note 1.
Google does make the entire dataset available for download, but it does not currently allow one to download the raw data associated with a specific search request.