Google maintains a multilingual database of published language. By scanning books en masse, Google is able to process the text and provided statistical data-based frequency of word appearance. With the Google Ngram Viewer search tool, you can search through that voluminous statistical data rapidly and effectively. By comparing the relative popularity of words, you can map how language and culture have changed over time. Ngram can do much more than simply report word frequency within Google’s vast textual corpus, however.
Basic Searches (1-grams)
1. Type your keyword in the Ngram search box.
2. If you want to search for all capitalization of a word, tick the “case-insensitive” box. In this search, it would return both “pizza” and “Pizza” in the results.
3. Set the search parameters beneath the search box. This includes the date range and the language corpus.
Date simply sets the limits to your graph’s Y-axis. Depending on the corpus you select, the maximum and minimum dates will vary widely.
The smoothing value removes atypical spikes and dips from your data. Lower smoothing values are more precise, while higher values reveal deeper trends only.
Selecting a Corpus
The corpus is the text collection that the Ngram Viewer will examine. The default of “English” is acceptable for casual browsing, but it can be highly academic.
“English Fiction” will more closely reflect common language. The standard “English” corpus can be non-fiction heavy, with plenty of technical words. Google offers brief explanations of what each corpus contains.
Advanced Search (2- through 5-grams)
By adding additional search words (“grams,” in the language of the search engine), you can create complex comparisons across time. You can enhance search with keyword commands like Google Search’s advanced functionality.
Separate sequential search terms with a comma.
The Ngram Viewer will display the relative frequency of your search terms in a single graph. Hover over the graph’s lines to see precise data points.
Use the asterisk (“*”) in your search terms as a wildcard. For example, “Bachelor of *” would return results for many Bachelor’s degrees.
To find all the inflections of a term, append the “_INF” text command. This searches for every inflection of the attached word, like the various forms of “to be” in English.
Parts of Speech
If a word includes many parts of speech, you can append text operators to be specific. The valid parts of speech in Google’s database include all of the following:
- _ADJ_: adjective (fast, large, smart)
- _ADV_: adverb (quickly, later, always)
- _PRON_: pronoun (their, it, we)
- _DET_: determiner or article (a, an, the)
- _ADP_: adposition (prepositions and postpositions)
- _NUM_: numeral (first, second, fifth)
- _CONJ_: conjunction (and, nor, but)
- _PRT_: particle, which is a catchall, rarely-used category for other word functions
Each of these grams can be combined into phrases. For example, “_ADJ_ boy” would return adjective + “boy” word pairs.
To specify a specific part of speech for one search term, append it to the end: i.e., “water_VERB”, without a trailing underscore.
To include every part of speech for a given word, use the wildcard operator after the underscore, as seen below.
Using Functional Variables
Functional variables let you search by the function or placement of words.
- _ROOT_ is a placeholder for the root of the sentence’s parse tree, This is typically the primary subject or the word modified by the verb.
- _START_ indicates the beginning of a sentence (“_START_ President Obama” returns only sentences that start with the phrase “President Obama”).
- _END_ indicates the end of a sentence (“_ADP_ _END_” returns sentences that end in prepositions).
By combining search terms with arithmetic operators, you can perform simple mathematical analysis with values for term frequency:
- + adds multiple expressions into one search term
- – subtracts the expression on the right from the expression on the left, providing a quick way to compare the relative use of two search terms.
- / divides the expression on the left by the expression on the right
- * multiplies the expression to compare ngrams of widely varied frequency. Make sure to enclose the whole ngram in parentheses to avoid having the asterisk parsed as a wildcard character.
- : searches for the ngram on the left within the corpus on the right
Finally, you can set dependencies with “=>” to search linguistic relationships. “car=>fast” would return results where “fast” was grammatically dependent on, or modifying, the word “car.” This can be mixed freely with any of the advanced search operations.
When working multi-grams, your search can quickly get complicated. Some of these search techniques play well together, while others are incompatible. The best way to find out if something works is to simply try it. For example, the _INF tag is highly flexible, while _VERB is picky. You’ll quickly learn the quirks as you delve into the Ngram Viewer’s toolkit.
Icon credit: Good Ware