Zipf’s law – Of dwarves and giants

Imagine this: around 6 percent of the things you say and write are “the…” and that’s it: the is the most frequent word of the English language and you use it altogether probably as much as often compared to other words. But this fact is just the tip of the iceberg of a rather puzzling and remarkable property of the human language. When you look at the frequency ranking of the top 20 words in English, namely: the of and to a in that it is was I for on you he be with as by at (cf. or another source, the words occur according to a highly regular and systematic frequency distribution the so called Zipf’s law named after the linguist George Kingsley Zipf (1902-1950) (cf. Pustet 2004). According to his work the second most frequent word of a language appears half as often as the most frequent word, the third most frequent one a third as often, the forth a forth as often, and so on until you get something like this:

And this works for all the words in a language, from highly frequent ones like the to less frequent ones like jellyfish. So the frequency of a word is just 1 over its rank and follows therefore the Zipfian power law or maybe even a set pattern so to speak. The frequency of words of a natural language vary in this way enormously, which is not trivial at all; as a result there are few ‘giant words’ as the or with and countless many dwarves as ravioli or catamaran and those giants cover a ginormous amount of the language produced.

And this is not only true for English but for all languages for which so far data is available, even for languages, which are not even deciphered yet as e. g. Meroitic (cf. Smith 2008), which could indicate that this pattern applies to all languages in the world. Just have a look at this:

(cf. Bentz et al. 2015 or Piantadosi 2014: 1117 for even more languages: Spanish, Russian, Greek, Portuguese, Chinese, Swahili, Chilean, Finnish, Estonian, French, Czech, Turkish, Polish, Basque, Maori, Tok Pisin)

It is to some extent even true for the around 470 words in this tiny little piece of blog:

But why is that? Very many linguists tried to figure this out and give a good reason for it. The longer than usual bibliography below gives an impression of that. For example Altmann et al. (2011) claim that a word’s certain use, its niche, which means its characteristic properties and contexts in which it is used have a strong impact on its frequency in a language and also on the changes involved over time. To put it simple, people start to use the word chat once the concept of chat is ‘invented’ with which the total amount of occurrences increases. However this is only one of many explanations, notions or implications of the Zipfian Distribution or the language riddle of giants and dwarves and yet there is still a lot to explore about it.

For further reading explore the literature below.

Jonas Schreiber (FAU Erlangen-Nürnberg)
Intern at Brill’s Linguistic Bibliography


