Purchasing new hardware? Read our latest product comparisons

What’s in a word? Researchers say it depends how long it is


February 16, 2011

MIT researchers say that a word's length is a reflection not of its frequency, but of the information it contains (Image: sjcockell)

MIT researchers say that a word's length is a reflection not of its frequency, but of the information it contains (Image: sjcockell)

The idea that the length of a word is a reflection of the frequency with which it is used in order to make language more efficient is a theory that has held sway for decades. With “the”, “of” and “and” the three most commonly used words in the American English vocabulary according to the Brown Corpus the theory seems to make sense. And just consider how long it would take to get out a sentence if “the” were as long as the name of an Icelandic volcano. Now a team of MIT cognitive scientists has used Google data to develop an alternative theory that a word’s length actually reflects the amount of information it contains.

More words often better than one

Although the notion that higher frequency of use engenders shorter words has an intuitive appeal to it, Steven Piantadosi, a PhD candidate in MIT’s Department of Brain and Cognitive Sciences (BCS), says such a theory doesn’t take into account the dependencies between words.

That is, many words, such as the three commonly used words listed above, typically appear in predictable sequences along with other words. The researchers found that short words are not necessarily highly frequent, but because they don’t contain much information by themselves, appear with strings of other familiar words that, together, convey information.

Although it does it in a different way, the researchers say this creates an efficiency of its own with the clustering of short words helping to “smooth out” the flow of information in language by forming strings of similar-sized language packets. Also, whether delivered through clusters of shorter words or through individual longer words carrying greater information, language tends to convey information at consistent rates.

Googling the answer

For their study, which focused on 11 European languages, the MIT researchers studied an enormous data set of online documents posted by Google. Because the documents included a lot of Internet-specific character sequences not comprising words, such as “www”, the team first catalogued texts from Open Subtitles, a database of movie translations, and searched for words used in those documents when mining the larger Google database.

“Movie subtitles are words used naturalistically, so we took words used frequently in that data set and pulled their statistics from Google,” explains Piantadosi.

To evaluate how much information a word contains, the researchers defined information as existing in an inverse relationship to the predictability of the words. That is, words often occurring after familiar sequences of two, three or four other words contain the least information individually, while words that have a minimal relationship to the words preceding them contain, individually, more information.

For example, the “eat” in “you are what you eat” contains less information than the word “contagious” in “you are contagious.”

The researchers acknowledge there are examples that support their thesis as well as ones that don’t. For example, words as different in length as “mind” and “organization” appear with virtually the same frequency, while words of different lengths, such as “menu” and “selection,” contain virtually the same informational content. However, they say their study found that 10 percent of the variation in word length is attributable to the amount of information contained in the words.

Although the researchers admit this isn’t a high figure by itself, they say it is about three times as large as the variation in word length attributable to frequency. For English words, nine percent of the variation in length is due to the amount of information the word contains, and one percent stems from frequency.

The MIT team’s study was published online last month in the Proceedings of the National Academy of Sciences (PNAS). Piantadosi is now data-mining techniques similar to the ones used for the paper to study the role of ambiguity in language, studying how the meaning of words with multiple potential definitions becomes clarified by the presence of frequently appearing words around them.

About the Author
Darren Quick Darren's love of technology started in primary school with a Nintendo Game & Watch Donkey Kong (still functioning) and a Commodore VIC 20 computer (not still functioning). In high school he upgraded to a 286 PC, and he's been following Moore's law ever since. This love of technology continued through a number of university courses and crappy jobs until 2008, when his interests found a home at Gizmag. All articles by Darren Quick
1 Comment

I am curious to know how this study is effected by words that are borrowed by the English language, as proper nouns, that were once Native languages. For example Milwaukee, Chicago, and Mississippi are somewhat meaningless place names to us, yet once were a phrase description of the locations climate, vegetation, or population.

Paul Ensign
Post a Comment

Login with your Gizmag account:

Related Articles
Looking for something? Search our articles