March 23, 2017

Data from multiple directions

A case study in pulling data from an interactive map with a little tech savvy and some regex

Note from @mtdukes: This post is part of a series of semi-regular walkthroughs and exercises connected with the Computer-Assisted Reporting Study Hall project. Want to join CAR Study Hall? Fill out our interest form and tell us what you're most interested in learning.

Our first weekly challenge over at CAR Study Hall celebrated Sunshine Week with a simple proposition: Submit a data request to a government agency in your local area.

The group shared several cool requests, including one from Rossie Izlar for the data behind this interactive map of oceanfront sandbags along the coast of North Carolina.

Interactive maps are funny sometimes, because you can often root around in the HTML/Javascript to find where the data are coming from -- and it's often structured in some kind of way that it will actually be useful when you get it. So while that request was churning through the bureaucracy, we also opted to get the online version as a backup.

This guide will walk you through the process.

What you'll learn

  • How to inspect code on a Web page
  • Basic structured data file formats and how to convert them
  • How to use regular expressions to clean up data

What you'll need

  • QGIS, a free, open-source mapping and GIS tool
  • Sublime Text, a robust text editor (free to evaluate)

Experience required

  • None.

Steps

  1. Download the data file
  2. Convert the file
  3. Examine the structure
  4. Simple find and replace
  5. Pulling out the big guns with regex
  6. Applying the regex to replace

Step 1: Download the data file

Begin by using the "Inspect Element" to peek at the code on the map page. With Chrome, this is available in the right-click menu.

The designer of this particular page embedded the Javascript for the map inline, so we can easily expand the code and -- just like that -- find the location of the source data file.

Using "Inspect Element" to find the file

After navigating to that page via our browser, we'll automatically download a KMZ, or Keyhole Markup Language Zipped file, which is a fairly common format used in applications like Google Earth.

You can open it with a text editor, but there's a ton of junk in there we don't need related to the map data. So let's get rid of that.

Step 2: Convert the file

There are a few online applications that can probably help, but my go-to for stuff like this is a mapping application called QGIS. It's free, incredibly full-featured and much less buggy than it used to be. And although there's a lot it can do, we're really only going to use it convert the KMZ file to a format called geojson, or geospatial json (which is itself just another filetype for structuring data).

Start up QGIS and click the "Add Vector Layer..." button along the left toolbar. You can then upload your KMZ file. 

Loading the KMZ file

With your layer loaded, right-click on it in your layer menu on the left and click "Save as..." Change the file format to "GeoJSON," but keep all the other setting the same.

So there we are.

We can use Sublime to look at the contents of the file, which is still filled with a bunch of useless garbage. A quick look at the original online map shows us why.

For each of the data elements -- in this case, sandbags -- there's an HTML table encoded into the file so it looks nice on the Web page when you click on the map.

PROTIP: Things that are designed to look nice for humans don't make good structured data.

Step 3: Examine the structure

It's important to understand what we mean when we say a JSON or GeoJSON file is "structured." If we paste the text into a simple JSON viewer like this one, we'll see something that looks like the image below. 

Examining JSON elements

You can see these expandable and collapsable subgroups, and in particular, there's a category called "features" that contains hundreds of data elements. These are our sandbags, and clicking on each will show us that junky HTML table embedded under "properties" and "description." We could certainly clean this up to turn it into more useful JSON file.

But instead, let's try to build a comma-separated value file -- something we can easily open in Excel. What we want to do is take the individual pieces of information out of that HTML table and store them in a row, along with other useful info like the coordinates.

And for that, we can use regular expressions, a string of text characters that can specify a search pattern programs like text editors can use to find, delete and/or replace stuff you don't need. It's like a slightly more complicated version of when you do a Find-Replace in Word to change all instances of "it's" to "it is".

The beauty is that we can do this find and replace all at once, instead of hundreds of times, and we can make it pretty complex. But first we need to find a search pattern before we can translate it into regex.

Step 4: Simple find and replace

We're focused on the "description" field, which looks a little something like this:

"description": "<html xmlns:fo=\"http:\/\/www.w3.org\/1999\/XSL\/Format\" xmlns:msxsl=\"urn:schemas-microsoft-com:xslt\">\n\n<head>\n\n<META http-equiv=\"Content-Type\" content=\"text\/html\">\n\n<meta http-equiv=\"content-type\" content=\"text\/html; charset=UTF-8\">\n\n<\/head>\n\n<body style=\"margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;\">\n\n<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px\">\n\n<tr style=\"text-align:center;font-weight:bold;background:#9CBCE2\">\n\n<td><\/td>\n\n<\/tr>\n\n<tr>\n\n<td>\n\n<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px\">\n\n<tr>\n\n<td>ID:<\/td>\n\n<td>2<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Status:<\/td>\n\n<td>Submerged<\/td>\n\n<\/tr>\n\n<tr>\n\n<td>Permit Issued:<\/td>\n\n<td>N\/A<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Location:<\/td>\n\n<td>Bald Head Island<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/body>\n\n<\/html>\n\n"

Yuck.

We don't need all that HTML code in there -- that we know. But if you look closely, you can see there's data we do want. In this case, it's the content in the HTML table element.

<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px\">\n\n<tr>\n\n<td>ID:<\/td>\n\n<td>2<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Status:<\/td>\n\n<td>Submerged<\/td>\n\n<\/tr>\n\n<tr>\n\n<td>Permit Issued:<\/td>\n\n<td>N\/A<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Location:<\/td>\n\n<td>Bald Head Island<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/td>\n\n<\/tr>\n\n<\/table>

Given this sample and the additional coordinate information, we can start to visualize what we want our data to look like once we've cleaned it up in a CSV file.

id, status, permit_issued, location, lat, lng
2, "Submerged", "N/A", "Bald Head Island", 33.858446134464273, -78.003299912361328
3, "Submerged", "N/A", "Bald Head Island", 33.858446134464273, -78.003299912361328
5, "Installed", "36859D", "Bald Head Island", 33.863370187627197, -78.007940634957222

Notice we're using simple field names (no spaces or caps), following good practice for structuring data in spreadsheets.

Open the GeoJSON file in Sublime Text, navigate to the Find menu and click on "Replace..." This will open up the find-replace dialogue at the bottom of the window.

We can use this to get rid of the header junk that precedes the "ID" field on every single one of our sandbag entries, replacing it with ... nothing! Highlight the section and copy it, then paste it into the "Find What" field. Leave the "Replace With" field blank. Click "Replace All."

Replacing the header junk

We can do the same thing to the junk between the actual data in the HTML table and the geometric coordinates, which includes some holdovers from the mapping coordinates. But instead of leaving the "Replace With" field blank, we're going to add a comma. This will separate our "location" field from our "lat" field in the spreadsheet. So before you "Replace All", your screen should look something like this.

Replacing the footer with a comma

Getting closer! Generally, stuff at the beginning and the end of data we want is easy to find and replace. We can do the find-and-replace again with these characters , 0.0 ] } }, that appear after our longitude coordinate and before our next ID entry.

In this case, we can leave the "Replace With" field blank again.

Step 5: Pulling out the big guns with regex

When data is intermingled with junk, regex can really come in handy. That's because it can also capture variables between patterns and preserve them when you do the replace.

CONFESSION: I use regex too infrequently to remember all the details, so I frequently lean on a site called Rubular, which helps you test out different regex patterns on your data.

Using Rubular to test your regexA sample of match groupsFor this problem, I took a small sample of the data and pasted it right into the "Your test string" window. Then I used the top line to start building my regular expression.

If you take a closer look at my test here, you'll notice the site also shows "Match groups," which looks exactly like the data we want.

Herein lies the beauty of regex: We can capture this data and essentially strip it from the junk, giving us a clean data set.

But before we apply our regular expression, let's examine its components. Here's the expression, as I've written it:

ID.*d>\\n\\n<td>(\d*).*Status.*d>\\n\\n<td>(\w*).*Permit Issued:.*<td>(.*)<\\\/td>.*Location.*<td>([^,]*),([^,]*), (.*)

This looks complicated, I know, but let's rebuild it step by step.

I know there are six fields I want to capture here, starting with the ID. So to begin, I want my expression to match whenever it finds the character string ID.

ID

If I wanted to, I could type out all the junk in between the ID and the data I want, but that's too time-consuming. Instead, I can make my job easier by using ., which matches any single character, with *, which tells the program to match with as many of the preceding characters as possible. So whenever you see .*, you know you're telling the program to look for an unlimited number of any character.

ID.*

Because we always want our expression in this section to match with the ID data, we want to make sure the next section is a distinct string of text that always precedes it, so after a little experiementing, we add d>\\n\\n<td>. Notice there are extra forward slashes here: they're used to "escape" characters that regex typically interprets as special instructions -- in this case we want to match literal forward slashes.

ID.*d>\\n\\n<td>

Next, we can use parenthesis to "capture" our data so we can spit it out later. We know that the ID is going to be a number, so we can use \d to capture a digit, and combine it with * to capture as many digits as are present. So now, we have:

ID.*d>\\n\\n<td>(\d*)

All we need is to tell our expression when to stop matching, so we can cap it off with .*Status to signal we're moving on to the next piece of the match.

ID.*d>\\n\\n<td>(\d*).*Status

In this way, we can methodically guess and test which sequences of patterns will give us a match and capture our data. There are many ways to write a regular expression, and you should experiment a bit with what works and what doesn't (mine works, but probably isn't the most efficient).

It helps, of course, that Rubular includes a handy regex cheat sheet right on the page!

Step 6: Applying the expression to replace

Regex buttonOnce you've built your regular expression, you can jump back to Sublime to apply it.

Superpowering your find-and-replace just requires that you activate the "Regular Expression" button in the bottom left.

Paste your full regular expression in the "Find What" field in Sublime. You should see the individual matches highlighted, and they'll correspond with each discrete "row" of data.

In Sublime, you can use the $ symbol with a numeral to replace with the data you've captured. Because we had six fields that means $1 through $6. We should separate our fields with commas, and because we also know our "status," "permit_issued" and "location" fields are text data, we should surround them with quotes. Our "Replace With" field, then, should look like this: $1,"$2","$3","$4",$5,$6.

Click "Replace All" to clean out the junk from all 364 pieces of data instantly!

Using regex to find and replace

The resulting data is almost perfect, although you'll want to replace the first few lines of junk with the header row and delete a bit of leftover nastiness from the bottom of the file.

IMPORTANT: It's always a good idea to spot check your clean data against the original source. Make sure you have the correct number of rows (sandbags in our case) and randomly pull 10 to 15 entries to make sure your information didn't get corrupted or jumbled.

After these final cleanup steps, you can save the file as a CSV and open it right up in Excel. Now you're ready to do your analysis, and you no longer have to wait for a PIO to fulfill your records request to get to work.

The great thing about attacking the acquisition of data from multiple sides is that you can compare the two sets of information to each other once your request is finally fulfilled, and that can sometimes raise interesting questions. Is the data online the same or different? Is the agency providing a more up-to-date version? Are there more fields kept internally that aren't released publicly via the online map?

Answering these questions can help you better understand the data, and might even provide a more interesting avenue to explore.

Happy hunting and cleaning!

Want to join Computer-Assisted Reporting Study Hall? Fill out our interest form and tell us what you're most interested in learning.