April 19, 2017
This post is part of a series of semi-regular walkthroughs and exercises connected with the Computer-Assisted Reporting Study Hall project. Want to join CAR Study Hall? Fill out our interest form and tell us what you're most interested in learning.
Application programming interfaces, or APIs, can be a great way to access valuable data.
But sometimes we just want to be able to pull that data into a spreadsheet so we can analyze it in programs like Excel. As we dive into the second part of last week's challenge for CAR Study Hall, this guide will walk you through how to use a simple program written in a code language called Python to download data from an API and save it as a CSV.
We'll be targeting the API from the U.S. Department of Education's College Scorecard data, which contains information on institutions of higher learning across the U.S. This walkthrough assumes you've already obtained an API key from data.gov and constructed a valid query for the College Scorecard API during part I of the exercise, so if you haven't go ahead and do that before you start.
Log in to your account in PythonAnywhere. After navigating to the Dashboard, click the "Files" tab. You should see a list of the files and directories in your account, which should be pretty bare-bones right now.
Click the button to "Upload a file" and add our collegescorecard_v2.py file you downloaded at the beginning of this exercise. This is our Python program, and it's the file we'll run to grab the data we need.
Click the "Consoles" tab in PythonAnywhere. If you're brand new to the service, you'll probably see a message that says "You have no consoles." We're going to start a new console by clicking the "Bash" link under "Start a new console."
You've now got a shiny new console you can use to execute code! Better yet, this one already has Python installed, which we can verify by typing the following command and hitting Enter/Return on your keyboard.
Let's also verify that we've uploaded our program by checking the contents of our present working directory. To do that, we'll type the following command and hit Enter/Return.
You should see two files, one a text file called "README.txt" and our college_scorecard_v2.py file.
Before we run our program, let's take a peek under the hood and see what it's doing. I've included comments -- lines that begin with the
# symbol -- to walk you through, step-by-step. Comments are ignored by the computer when it runs the program, so they're designed to be read by humans.
It's not important you understand every piece of the code. But one thing worth noticing is how good documentation can help us understand the inner workings of a program.
IMPORTANT: The vast majority of the code you write will probably be read only by you, even if you collaborate on a team. But don't assume that you'll remember what you were doing when you go to revisit the code one week, one month or one year later. I've spent many hours cursing past me for not adequately documenting code for the future and trying to reverse engineer whatever the hell I did.
Essentially, this program takes your API query, calculates the number of pages it needs to check and goes about writing each piece of data to a row in a spreadsheet. Thats it!
If this was a piece of code you were planning to use over and over and over again, it would probably be a good idea to build in some error checking. But since we're just taking a brief dive into Python here, that's outside the scope of this walkthrough.
Instead, we'll do some manual checking just to ensure the API query we've built is valid enough for our crude program to handle.
APIs allow developers (like us!) to request data programmatically. Most APIs, including the federal government's data.gov portal, require an API key -- a string of numbers and letters that helps them limit access to a set number of users so their systems aren't overwhelmed.
Once you have an API key, you can use it to build a query that specifies the exact data you want. Most APIs have a data dictionary -- like this one from the College Scorecard's documentation -- which developers use to construct their queries in the proper format according to the people who designed the API.
The query looks a lot like a URL for a website, and when you enter it into your browser, you'll get a response in the form of a structured text file (typically something like JSON format). Browser extensions like JSONView can help make these text files easier to read in your browser.
If, like me, you were interested in using the College Scorecard API to get a spreadsheet of each college's share of first-generation students, you would have a query that looks like this:
If you enter this query into your browser (substituting in your API key for "MY_API_KEY_HIDDEN"), you'll get something like this.
The API responds with JSON, or structured text, that includes an ID number, school name and share of first-generation students for each college or university, along with some additional metadata that describes attributes of the data itself (for example, how many results your query returned total).
If you're getting an error, check your API key or your field names to make sure they're valid. Consult the data dictionary and the documentation to find exactly what you're looking for and what to call it. And remember: Field names are case sensitive, so don't let that trip you up.
CLARIFICATION: If you read the College Scorecard documentation carefully, you might notice that the system does provide a CSV option, which seems a hell of a lot easier than learning how to run code. We're doing it this way for two reasons: 1.) We can be sure we pull all the data we want from multiple pages of returns, and 2.) Because this is a really great, concrete example of how Python and other programming languages can automate tedious tasks humans aren't great at, like clicking buttons dozens of times.
Back in your PythonAnywhere Bash console, it's time to run our program.
Type the following command, substituting your query (complete with API key) for
'YOUR_QUERY' below. This is an argument, and it allows the user to pass a variable directly to the program without altering the code. Make sure your query is surrounded by single quotes!
python collegescorecard_v2.py 'YOUR_QUERY'
If our query is valid, you should start to see the program displaying output -- all those lines where we told the program to "print" something to let us know where it's at in the process.
My query had to cycle through about 19 pages of data, which it does quickly despite the fact that we told the program to sleep for a few seconds before hitting the API service again.
This is good API citizenship, because it spaces out our requests in a good faith effort to avoid overloading the API service. It also avoids the risk of getting our program rate limited, essentially hitting a "speed limit" imposed by some API terms of service.
It's always a good idea to check the usage limits of the APIs you're using to make sure you're staying within the bounds of the terms of service.
When the program is finished, you should see it print "...done" followed by the return of your
Click on "Dashboard" to navigate away from your Bash console, and click the "Files" tab to check for our CSV. You should see the shiny new scorecard-data.csv listed in your files, which you can download and open right in Excel.
The code I've written here is quick and dirty, but it can make pretty quick work of gathering thousands of rows of data from this particular government website. You can also see how code like this could be tweaked slightly to work with many other APIs in exactly the same way.
Want to dig deeper into Python? Check out the documentation on my Coding with Bots workshop, which takes you through Python installation and Python basics on your own machine (Mac or PC) and includes additional resources for teaching yourself the ends and outs of the programming language.
March 31, 2017
I spend a lot of time in front of my machines -- both at work and at home -- personalizing them with tools that make my job easier and my work more effective.
The list below is an attempt to catalogue what I use in my work on a regular basis. I'm not a power user in all of these tools -- some I might only dust off every month or two. But it's helpful to get a sense of the full scope of everything I've installed for the purpose of journalism.
If you've got a brand new machine, this post on setting your Mac up for development is a great start (The NPR visuals team actually updates it from time to time). Another useful one is here (although I don't use all of these things in either).
I'll add new stuff here as I install it.
ffmpeg for making GIFs and working with video
PDFtk server for working with PDFs
Wget for downloading files recursively
Homebrew for installing programs for Mac
Python for running code
virtualenv & virtualenvwrapper (with pip) for creating virtual environments
Git for version control and Github interface
Tabula for converting PDFs to spreadsheets
Sequel Pro for working with large databases and MySQL files through MySQL
LICEcap for creating GIFs from screen captures
QGIS for mapping
Excel for working with spreadsheets
ImageOptim for optimizing images for the Web
RStudio for working in R
Cyberduck for FTPing files
GIMP for editing photos
Audacity for editing audio
Slack for communicating with teams
SublimeText for editing text and writing code
Chrome for Web browsing
Firefox for specialized usage
Arduino IDE for working with hardware
Dropbox for file storage (paid subscription)
OpenRefine for cleaning data
Jupyter for sharing live code
Pocket for saving articles (Chrome)
Scraper for capturing simple tables on Web pages (Chrome)
JSONView for looking a JSON files (Chrome)
OneTab for storing all my tabs from a browser session in one place (Chrome)
DownThemAll to easily download all links on a page (Firefox)
VideoDownloadHelper to download videos from the Web (Firefox)
Wireservice for data analysis and tools to work with CSVs
Django for building apps quickly
Tor for browsing anonymously
HideMyAss for encrypting network traffic via VPN (paid subscription)
PANDA for storing data
Klaxon for alerting users to changes to Web pages (forthcoming)
DocumentCloud for storing, annotating and sharing documents
Overview for exploring huge document dumps
Mapshaper for optimizing and converting shapefiles
ColorBrewer for selecting map colors
DiffChecker for checking the differences between two texts
Mr. Data Converter for converting spreadsheet data into JSON
March 23, 2017
Note from @mtdukes: This post is part of a series of semi-regular walkthroughs and exercises connected with the Computer-Assisted Reporting Study Hall project. Want to join CAR Study Hall? Fill out our interest form and tell us what you're most interested in learning.
The group shared several cool requests, including one from Rossie Izlar for the data behind this interactive map of oceanfront sandbags along the coast of North Carolina.
This guide will walk you through the process.
Begin by using the "Inspect Element" to peek at the code on the map page. With Chrome, this is available in the right-click menu.
After navigating to that page via our browser, we'll automatically download a KMZ, or Keyhole Markup Language Zipped file, which is a fairly common format used in applications like Google Earth.
You can open it with a text editor, but there's a ton of junk in there we don't need related to the map data. So let's get rid of that.
There are a few online applications that can probably help, but my go-to for stuff like this is a mapping application called QGIS. It's free, incredibly full-featured and much less buggy than it used to be. And although there's a lot it can do, we're really only going to use it convert the KMZ file to a format called geojson, or geospatial json (which is itself just another filetype for structuring data).
Start up QGIS and click the "Add Vector Layer..." button along the left toolbar. You can then upload your KMZ file.
With your layer loaded, right-click on it in your layer menu on the left and click "Save as..." Change the file format to "GeoJSON," but keep all the other setting the same.
So there we are.
We can use Sublime to look at the contents of the file, which is still filled with a bunch of useless garbage. A quick look at the original online map shows us why.
For each of the data elements -- in this case, sandbags -- there's an HTML table encoded into the file so it looks nice on the Web page when you click on the map.
PROTIP: Things that are designed to look nice for humans don't make good structured data.
It's important to understand what we mean when we say a JSON or GeoJSON file is "structured." If we paste the text into a simple JSON viewer like this one, we'll see something that looks like the image below.
You can see these expandable and collapsable subgroups, and in particular, there's a category called "features" that contains hundreds of data elements. These are our sandbags, and clicking on each will show us that junky HTML table embedded under "properties" and "description." We could certainly clean this up to turn it into more useful JSON file.
But instead, let's try to build a comma-separated value file -- something we can easily open in Excel. What we want to do is take the individual pieces of information out of that HTML table and store them in a row, along with other useful info like the coordinates.
And for that, we can use regular expressions, a string of text characters that can specify a search pattern programs like text editors can use to find, delete and/or replace stuff you don't need. It's like a slightly more complicated version of when you do a Find-Replace in Word to change all instances of "it's" to "it is".
The beauty is that we can do this find and replace all at once, instead of hundreds of times, and we can make it pretty complex. But first we need to find a search pattern before we can translate it into regex.
We're focused on the "description" field, which looks a little something like this:
"description": "<html xmlns:fo=\"http:\/\/www.w3.org\/1999\/XSL\/Format\" xmlns:msxsl=\"urn:schemas-microsoft-com:xslt\">\n\n<head>\n\n<META http-equiv=\"Content-Type\" content=\"text\/html\">\n\n<meta http-equiv=\"content-type\" content=\"text\/html; charset=UTF-8\">\n\n<\/head>\n\n<body style=\"margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;\">\n\n<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px\">\n\n<tr style=\"text-align:center;font-weight:bold;background:#9CBCE2\">\n\n<td><\/td>\n\n<\/tr>\n\n<tr>\n\n<td>\n\n<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px\">\n\n<tr>\n\n<td>ID:<\/td>\n\n<td>2<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Status:<\/td>\n\n<td>Submerged<\/td>\n\n<\/tr>\n\n<tr>\n\n<td>Permit Issued:<\/td>\n\n<td>N\/A<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Location:<\/td>\n\n<td>Bald Head Island<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/body>\n\n<\/html>\n\n"
We don't need all that HTML code in there -- that we know. But if you look closely, you can see there's data we do want. In this case, it's the content in the HTML table element.
<table style=\"font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px\">\n\n<tr>\n\n<td>ID:<\/td>\n\n<td>2<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Status:<\/td>\n\n<td>Submerged<\/td>\n\n<\/tr>\n\n<tr>\n\n<td>Permit Issued:<\/td>\n\n<td>N\/A<\/td>\n\n<\/tr>\n\n<tr bgcolor=\"#D4E4F3\">\n\n<td>Location:<\/td>\n\n<td>Bald Head Island<\/td>\n\n<\/tr>\n\n<\/table>\n\n<\/td>\n\n<\/tr>\n\n<\/table>
Given this sample and the additional coordinate information, we can start to visualize what we want our data to look like once we've cleaned it up in a CSV file.
id, status, permit_issued, location, lat, lng
2, "Submerged", "N/A", "Bald Head Island", 33.858446134464273, -78.003299912361328
3, "Submerged", "N/A", "Bald Head Island", 33.858446134464273, -78.003299912361328
5, "Installed", "36859D", "Bald Head Island", 33.863370187627197, -78.007940634957222
Notice we're using simple field names (no spaces or caps), following good practice for structuring data in spreadsheets.
Open the GeoJSON file in Sublime Text, navigate to the Find menu and click on "Replace..." This will open up the find-replace dialogue at the bottom of the window.
We can use this to get rid of the header junk that precedes the "ID" field on every single one of our sandbag entries, replacing it with ... nothing! Highlight the section and copy it, then paste it into the "Find What" field. Leave the "Replace With" field blank. Click "Replace All."
We can do the same thing to the junk between the actual data in the HTML table and the geometric coordinates, which includes some holdovers from the mapping coordinates. But instead of leaving the "Replace With" field blank, we're going to add a comma. This will separate our "location" field from our "lat" field in the spreadsheet. So before you "Replace All", your screen should look something like this.
Getting closer! Generally, stuff at the beginning and the end of data we want is easy to find and replace. We can do the find-and-replace again with these characters
, 0.0 ] } }, that appear after our longitude coordinate and before our next ID entry.
In this case, we can leave the "Replace With" field blank again.
When data is intermingled with junk, regex can really come in handy. That's because it can also capture variables between patterns and preserve them when you do the replace.
CONFESSION: I use regex too infrequently to remember all the details, so I frequently lean on a site called Rubular, which helps you test out different regex patterns on your data.
For this problem, I took a small sample of the data and pasted it right into the "Your test string" window. Then I used the top line to start building my regular expression.
If you take a closer look at my test here, you'll notice the site also shows "Match groups," which looks exactly like the data we want.
Herein lies the beauty of regex: We can capture this data and essentially strip it from the junk, giving us a clean data set.
But before we apply our regular expression, let's examine its components. Here's the expression, as I've written it:
ID.*d>\\n\\n<td>(\d*).*Status.*d>\\n\\n<td>(\w*).*Permit Issued:.*<td>(.*)<\\\/td>.*Location.*<td>([^,]*),([^,]*), (.*)
This looks complicated, I know, but let's rebuild it step by step.
I know there are six fields I want to capture here, starting with the ID. So to begin, I want my expression to match whenever it finds the character string
If I wanted to, I could type out all the junk in between the
ID and the data I want, but that's too time-consuming. Instead, I can make my job easier by using
., which matches any single character, with
*, which tells the program to match with as many of the preceding characters as possible. So whenever you see
.*, you know you're telling the program to look for an unlimited number of any character.
Because we always want our expression in this section to match with the ID data, we want to make sure the next section is a distinct string of text that always precedes it, so after a little experiementing, we add
d>\\n\\n<td>. Notice there are extra forward slashes here: they're used to "escape" characters that regex typically interprets as special instructions -- in this case we want to match literal forward slashes.
Next, we can use parenthesis to "capture" our data so we can spit it out later. We know that the ID is going to be a number, so we can use
\d to capture a digit, and combine it with
* to capture as many digits as are present. So now, we have:
All we need is to tell our expression when to stop matching, so we can cap it off with
.*Status to signal we're moving on to the next piece of the match.
In this way, we can methodically guess and test which sequences of patterns will give us a match and capture our data. There are many ways to write a regular expression, and you should experiment a bit with what works and what doesn't (mine works, but probably isn't the most efficient).
It helps, of course, that Rubular includes a handy regex cheat sheet right on the page!
Once you've built your regular expression, you can jump back to Sublime to apply it.
Superpowering your find-and-replace just requires that you activate the "Regular Expression" button in the bottom left.
Paste your full regular expression in the "Find What" field in Sublime. You should see the individual matches highlighted, and they'll correspond with each discrete "row" of data.
In Sublime, you can use the
$ symbol with a numeral to replace with the data you've captured. Because we had six fields that means
$6. We should separate our fields with commas, and because we also know our "status," "permit_issued" and "location" fields are text data, we should surround them with quotes. Our "Replace With" field, then, should look like this:
Click "Replace All" to clean out the junk from all 364 pieces of data instantly!
The resulting data is almost perfect, although you'll want to replace the first few lines of junk with the header row and delete a bit of leftover nastiness from the bottom of the file.
IMPORTANT: It's always a good idea to spot check your clean data against the original source. Make sure you have the correct number of rows (sandbags in our case) and randomly pull 10 to 15 entries to make sure your information didn't get corrupted or jumbled.
After these final cleanup steps, you can save the file as a CSV and open it right up in Excel. Now you're ready to do your analysis, and you no longer have to wait for a PIO to fulfill your records request to get to work.
The great thing about attacking the acquisition of data from multiple sides is that you can compare the two sets of information to each other once your request is finally fulfilled, and that can sometimes raise interesting questions. Is the data online the same or different? Is the agency providing a more up-to-date version? Are there more fields kept internally that aren't released publicly via the online map?
Answering these questions can help you better understand the data, and might even provide a more interesting avenue to explore.
Happy hunting and cleaning!
Want to join Computer-Assisted Reporting Study Hall? Fill out our interest form and tell us what you're most interested in learning.
March 2, 2017
One of the things I knew I wanted to do when I came to Harvard this year for the Nieman Fellowship is get a better handle on a few topics I often struggle to understand and apply, namely statistics and topics in artificial intelligence.
I took courses in both last semester. But it helps to have good reference guides, especially when they're clear and concise. So one lesson I've learned this week: When it comes to creating crib sheets, never underestimate the abilities of undergraduates.
Over at Brown University, senior Daniel Kunin created a site called Seeing Theory, a beautiful visual introduction to probability and statistics. The site includes the exploration of topics from basic probability to linear regression (all using the D3 library). The visualizations are also interactive, which really allows you to get a sense of how the theories of statistics apply in practice.
Here at Harvard, senior Noah Yonack wrote up a great primer on machine learning for those of us without an extensive background in computer science and mathematics. It's 13 pages long, but does a great job introducing the basic vocabulary and techniques.
Right up front, it gets at a really fundamental question I had: What's the difference between machine learning and artificial intelligence?
The full document is below, shared with permission from Noah (thanks Noah!).
by Noah Yonack