April 19, 2017
This post is part of a series of semi-regular walkthroughs and exercises connected with the Computer-Assisted Reporting Study Hall project. Want to join CAR Study Hall? Fill out our interest form and tell us what you're most interested in learning.
Application programming interfaces, or APIs, can be a great way to access valuable data.
But sometimes we just want to be able to pull that data into a spreadsheet so we can analyze it in programs like Excel. As we dive into the second part of last week's challenge for CAR Study Hall, this guide will walk you through how to use a simple program written in a code language called Python to download data from an API and save it as a CSV.
We'll be targeting the API from the U.S. Department of Education's College Scorecard data, which contains information on institutions of higher learning across the U.S. This walkthrough assumes you've already obtained an API key from data.gov and constructed a valid query for the College Scorecard API during part I of the exercise, so if you haven't go ahead and do that before you start.
Log in to your account in PythonAnywhere. After navigating to the Dashboard, click the "Files" tab. You should see a list of the files and directories in your account, which should be pretty bare-bones right now.
Click the button to "Upload a file" and add our collegescorecard_v2.py file you downloaded at the beginning of this exercise. This is our Python program, and it's the file we'll run to grab the data we need.
Click the "Consoles" tab in PythonAnywhere. If you're brand new to the service, you'll probably see a message that says "You have no consoles." We're going to start a new console by clicking the "Bash" link under "Start a new console."
You've now got a shiny new console you can use to execute code! Better yet, this one already has Python installed, which we can verify by typing the following command and hitting Enter/Return on your keyboard.
Let's also verify that we've uploaded our program by checking the contents of our present working directory. To do that, we'll type the following command and hit Enter/Return.
You should see two files, one a text file called "README.txt" and our college_scorecard_v2.py file.
Before we run our program, let's take a peek under the hood and see what it's doing. I've included comments -- lines that begin with the
# symbol -- to walk you through, step-by-step. Comments are ignored by the computer when it runs the program, so they're designed to be read by humans.
It's not important you understand every piece of the code. But one thing worth noticing is how good documentation can help us understand the inner workings of a program.
IMPORTANT: The vast majority of the code you write will probably be read only by you, even if you collaborate on a team. But don't assume that you'll remember what you were doing when you go to revisit the code one week, one month or one year later. I've spent many hours cursing past me for not adequately documenting code for the future and trying to reverse engineer whatever the hell I did.
Essentially, this program takes your API query, calculates the number of pages it needs to check and goes about writing each piece of data to a row in a spreadsheet. Thats it!
If this was a piece of code you were planning to use over and over and over again, it would probably be a good idea to build in some error checking. But since we're just taking a brief dive into Python here, that's outside the scope of this walkthrough.
Instead, we'll do some manual checking just to ensure the API query we've built is valid enough for our crude program to handle.
APIs allow developers (like us!) to request data programmatically. Most APIs, including the federal government's data.gov portal, require an API key -- a string of numbers and letters that helps them limit access to a set number of users so their systems aren't overwhelmed.
Once you have an API key, you can use it to build a query that specifies the exact data you want. Most APIs have a data dictionary -- like this one from the College Scorecard's documentation -- which developers use to construct their queries in the proper format according to the people who designed the API.
The query looks a lot like a URL for a website, and when you enter it into your browser, you'll get a response in the form of a structured text file (typically something like JSON format). Browser extensions like JSONView can help make these text files easier to read in your browser.
If, like me, you were interested in using the College Scorecard API to get a spreadsheet of each college's share of first-generation students, you would have a query that looks like this:
If you enter this query into your browser (substituting in your API key for "MY_API_KEY_HIDDEN"), you'll get something like this.
The API responds with JSON, or structured text, that includes an ID number, school name and share of first-generation students for each college or university, along with some additional metadata that describes attributes of the data itself (for example, how many results your query returned total).
If you're getting an error, check your API key or your field names to make sure they're valid. Consult the data dictionary and the documentation to find exactly what you're looking for and what to call it. And remember: Field names are case sensitive, so don't let that trip you up.
CLARIFICATION: If you read the College Scorecard documentation carefully, you might notice that the system does provide a CSV option, which seems a hell of a lot easier than learning how to run code. We're doing it this way for two reasons: 1.) We can be sure we pull all the data we want from multiple pages of returns, and 2.) Because this is a really great, concrete example of how Python and other programming languages can automate tedious tasks humans aren't great at, like clicking buttons dozens of times.
Back in your PythonAnywhere Bash console, it's time to run our program.
Type the following command, substituting your query (complete with API key) for
'YOUR_QUERY' below. This is an argument, and it allows the user to pass a variable directly to the program without altering the code. Make sure your query is surrounded by single quotes!
python collegescorecard_v2.py 'YOUR_QUERY'
If our query is valid, you should start to see the program displaying output -- all those lines where we told the program to "print" something to let us know where it's at in the process.
My query had to cycle through about 19 pages of data, which it does quickly despite the fact that we told the program to sleep for a few seconds before hitting the API service again.
This is good API citizenship, because it spaces out our requests in a good faith effort to avoid overloading the API service. It also avoids the risk of getting our program rate limited, essentially hitting a "speed limit" imposed by some API terms of service.
It's always a good idea to check the usage limits of the APIs you're using to make sure you're staying within the bounds of the terms of service.
When the program is finished, you should see it print "...done" followed by the return of your
Click on "Dashboard" to navigate away from your Bash console, and click the "Files" tab to check for our CSV. You should see the shiny new scorecard-data.csv listed in your files, which you can download and open right in Excel.
The code I've written here is quick and dirty, but it can make pretty quick work of gathering thousands of rows of data from this particular government website. You can also see how code like this could be tweaked slightly to work with many other APIs in exactly the same way.
Want to dig deeper into Python? Check out the documentation on my Coding with Bots workshop, which takes you through Python installation and Python basics on your own machine (Mac or PC) and includes additional resources for teaching yourself the ends and outs of the programming language.