Links to my...

Categories: Graphs, Codepen

Tags: Wikipedia

A while ago on YouTube I watched a video explaining Zipf’s Law and how it appears everywhere around us. Zipf’s Law is basically a power-law distribution that applies to several types of data. In language, it can be shown that the frequency of any word is approximately inversely proportional to its rank in a frequency table. Thus the most frequent word in a given text will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

I thought it would be a fun experiment to test Zipf’s Law by observing a random article on Wikipedia and displaying the resulting distribution of words in a graph. I knew that I would need to access Wikipedia’s public API to get the text of a random article, and I knew that I would need to parse and analyze that text to get my data (I built my own functions). Finally, I needed a way to display the data in some kind of chart (I chose Google Charts).

The JavaScript

The markup and the styling are super basic, so I’ll get right to the meat of the project: my JS code. I began experimenting with the Wikimedia API sandbox until I found the correct set of options for the AJAX call:

$.ajax({
   url: 'https://en.wikipedia.org/w/api.php',
   data: {
     action: 'query',
     generator: 'random',
     indexpageids: 1,
     grnnamespace: 0,
     prop: 'extracts',
     origin: '* ',
     format: 'json'
   },
   dataType: 'json',
   success: successCB
  });

Note: I used jQuery in my code for conciseness

That callback function at the end of the AJAX call handles the JSON data that is returned from the AJAX request. First, the page ID is extracted from the returned object and then used to store the page extract in a variable html which is then stripped of its HTML markup via another function, strip(), and returned to a constant text. I use two separate regular expressions to strip out extra punctuation and whitespace and store the result in finalStr. Finally, I .split() the string into words and store the array. The sortArr() function counts the appearances of each distinct word in the text and returns a 2D array in the form:
[[word, #], [word, #], ...].

function successCB(x) {
  const id = x.query.pageids[0];
  const html = x.query.pages[id].extract;
  vars.articleTitle = x.query.pages[id].title;
  const text = strip(html);
  var whitespace = new RegExp(/\s{2,}/g);
  var punc = new RegExp(/["'.,\/#!$%\^&\*;:{}=\-\_\`~()]/g);
  var nopunc = text.replace(punc,"");
  var finalStr = nopunc.replace(whitespace," ");
  var arr = finalStr.toLowerCase().split(' ');
  $("#target").text(text);
  vars.sortedObj = sortArr(arr);
  $("#target2").text(JSON.stringify(vars.sortedObj));
  // Load the Visualization API and the corechart package.
  google.charts.load('current', {'packages':['corechart']});
  // Set a callback to run when the Google Visualization API is loaded.
  google.charts.setOnLoadCallback(drawChart);
  // Callback that creates and populates a data table,
  // instantiates the pie chart, passes in the data and
  // draws it.
}

After the parsing and analysis of the text, I load up Google Charts and use a callback function drawChart() which creates the graph itself using the sorted data in vars.sortedObj. The API for this chart requires an array of the form [[word, #], [word, #], …] so I use the .forEach() method to iterate through the sorted array and .push() an entry to the storage variable (with a configurable limit of vars.max so that I don’t get a hundreds of x-axis entries).

function drawChart() {
  // Create the data table.
  var data = new google.visualization.DataTable();
  data.addColumn('string', 'X');
  data.addColumn('number', 'Frequency Distribution');
  vars.sortedObj.forEach(function(entry,i) {
    let word = entry.word;
    let count = entry.count;
    if (i<vars.max) vars.rows.push([word,count]);
  });
  data.addRows(vars.rows);
  var options = {
    // some styling options here...
  };
  // Instantiate and draw our chart, passing in some options.
  var chart = new google.visualization.LineChart(document.getElementById('chart'));
  chart.draw(data, options);
}

Cool! Now I have the AJAX call, the callback function successCB() that fires when the call returns, and then a drawChart() that actually takes the data and feeds it to the Google Chart to display on screen.

What I Learned

While Zipf’s Law may hold true on a larger scale, smaller sample sizes do not necessarily follow the power-law distribution. The shorter the excerpt, the less reliable the correlation holds. With large excerpts, the correlation becomes much more prominent.

On the programming side, I learned that regular expressions are really difficult to get right! It was so frustrating that I did a quick google search to find the right pattern that I was trying to match.

I also learned that creating data visualizations can be pretty fun and intuitive if you can understand the API for the tool. I went with Google’s charts because they had really good documentation and examples I could pull from. I want to do a lot more with data visualizations in the future!

See the Pen [Word Distribution on Wikipedia Articles](https://codepen.io/acrenwelge/pen/YQbBWo/) by Andrew ([@acrenwelge](https://codepen.io/acrenwelge)) on [CodePen](https://codepen.io).