The idea that the length of a word is a reflection of the frequency with which it is used in order to make language more efficient is a theory that has held sway for decades. With “the”, “of” and “and” the three most commonly used words in the American English vocabulary according to the Brown Corpus the theory seems to make sense. And just consider how long it would take to get out a sentence if “the” were as long as the name of an Icelandic volcano. Now a team of MIT cognitive scientists has used Google data to develop an alternative theory that a word’s length actually reflects the amount of information it contains.

More words often better than one

Although the notion that higher frequency of use engenders shorter words has an intuitive appeal to it, Steven Piantadosi, a PhD candidate in MIT’s Department of Brain and Cognitive Sciences (BCS), says such a theory doesn’t take into account the dependencies between words.

That is, many words, such as the three commonly used words listed above, typically appear in predictable sequences along with other words. The researchers found that short words are not necessarily highly frequent, but because they don’t contain much information by themselves, appear with strings of other familiar words that, together, convey information.

Although it does it in a different way, the researchers say this creates an efficiency of its own with the clustering of short words helping to “smooth out” the flow of information in language by forming strings of similar-sized language packets. Also, whether delivered through clusters of shorter words or through individual longer words carrying greater information, language tends to convey information at consistent rates.

Googling the answer

For their study, which focused on 11 European languages, the MIT researchers studied an enormous data set of online documents posted by Google. Because the documents included a lot of Internet-specific character sequences not comprising words, such as “www”, the team first catalogued texts from Open Subtitles, a database of movie translations, and searched for words used in those documents when mining the larger Google database.

“Movie subtitles are words used naturalistically, so we took words used frequently in that data set and pulled their statistics from Google,” explains Piantadosi.

To evaluate how much information a word contains, the researchers defined information as existing in an inverse relationship to the predictability of the words. That is, words often occurring after familiar sequences of two, three or four other words contain the least information individually, while words that have a minimal relationship to the words preceding them contain, individually, more information.

For example, the “eat” in “you are what you eat” contains less information than the word “contagious” in “you are contagious.”

The researchers acknowledge there are examples that support their thesis as well as ones that don’t. For example, words as different in length as “mind” and “organization” appear with virtually the same frequency, while words of different lengths, such as “menu” and “selection,” contain virtually the same informational content. However, they say their study found that 10 percent of the variation in word length is attributable to the amount of information contained in the words.

Although the researchers admit this isn’t a high figure by itself, they say it is about three times as large as the variation in word length attributable to frequency. For English words, nine percent of the variation in length is due to the amount of information the word contains, and one percent stems from frequency.

The MIT team’s study was published online last month in the Proceedings of the National Academy of Sciences (PNAS). Piantadosi is now data-mining techniques similar to the ones used for the paper to study the role of ambiguity in language, studying how the meaning of words with multiple potential definitions becomes clarified by the presence of frequently appearing words around them.