Categories: Graphs, Codepen
Tags: Wikipedia
A while ago on YouTube I watched a video explaining Zipf’s Law and how it appears everywhere around us. Zipf’s Law is basically a power-law distribution that applies to several types of data. In language, it can be shown that the frequency of any word is approximately inversely proportional to its rank in a frequency table. Thus the most frequent word in a given text will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
I thought it would be a fun experiment to test Zipf’s Law by observing a random article on Wikipedia and displaying the resulting distribution of words in a graph. I knew that I would need to access Wikipedia’s public API to get the text of a random article, and I knew that I would need to parse and analyze that text to get my data (I built my own functions). Finally, I needed a way to display the data in some kind of chart (I chose Google Charts).
The markup and the styling are super basic, so I’ll get right to the meat of the project: my JS code. I began experimenting with the Wikimedia API sandbox until I found the correct set of options for the AJAX call:
Note: I used jQuery in my code for conciseness
That callback function at the end of the AJAX call handles the JSON data that is returned
from the AJAX request. First, the page ID is extracted from the returned object and then
used to store the page extract in a variable html
which is then stripped of its
HTML markup via another function, strip()
, and returned to a constant text
.
I use two separate regular expressions to strip out extra punctuation and whitespace
and store the result in finalStr
. Finally, I .split()
the string into words and
store the array. The sortArr()
function counts the appearances of each distinct word
in the text and returns a 2D array in the form:
[[word, #], [word, #], ...]
.
After the parsing and analysis of the text, I load up Google Charts and use a callback
function drawChart()
which creates the graph itself using the sorted data in vars.sortedObj
.
The API for this chart requires an array of the form [[word, #], [word, #], …] so
I use the .forEach()
method to iterate through the sorted array and .push()
an
entry to the storage variable (with a configurable limit of vars.max
so that I don’t get a
hundreds of x-axis entries).
Cool! Now I have the AJAX call, the callback function successCB()
that fires when
the call returns, and then a drawChart()
that actually takes the data and feeds
it to the Google Chart to display on screen.
While Zipf’s Law may hold true on a larger scale, smaller sample sizes do not necessarily follow the power-law distribution. The shorter the excerpt, the less reliable the correlation holds. With large excerpts, the correlation becomes much more prominent.
On the programming side, I learned that regular expressions are really difficult to get right! It was so frustrating that I did a quick google search to find the right pattern that I was trying to match.
I also learned that creating data visualizations can be pretty fun and intuitive if you can understand the API for the tool. I went with Google’s charts because they had really good documentation and examples I could pull from. I want to do a lot more with data visualizations in the future!
See the Pen [Word Distribution on Wikipedia Articles](https://codepen.io/acrenwelge/pen/YQbBWo/) by Andrew ([@acrenwelge](https://codepen.io/acrenwelge)) on [CodePen](https://codepen.io).