a walkthrough for extracting and manipulating data from opencontext.org
Search for something interesting. I put ‘poggio’ in the search box, and then clicked on the various options to get the architectural fragments. Look at the URL:
See all that stuff after the word ‘Poggio’? That’s to generate the map view. We don’t need it.
We’re going to ask for the search results w/o all of the website extras, no maps, no shiny interface. To do that, we take advantage of the API. With open context, if you have a search with a ‘?’ in the URL, you can put
.json in front of the question mark, and delete all of the stuff from the
# sign on, like so:
Put that in the address bar. Boom! lots of stuff! But only one page’s worth, which isn’t lots of data. To get a lot more data, we have to add another parameter, the number of rows:
?rows=100&. Slot that in just before the p in
prop= and see what happens.
Now, that isn’t all of the records though. Remove the .json and see what happens when you click on the arrows to page through the NEXT 100 rows. You get a URL like this:
So – to recap, the URL is searching for 100 rows at a time, in the general object category, starting from row 100, and grabbing materials from Poggio. We now know enough about how open context’s api works to grab material.
- You could copy n’ paste -> but that will only get you one page’s worth of data (and if you tried to put, say, 10791 into the ‘rows’ parameter, you’ll just get a time-out error). You’d have to go back to the search page, hit the ‘next’ button, reinsert the
.jsonetc over and over again.
- automatically. We’ll use a program called
wgetto do this. (To install wget on your machine, see the programming historian Wget will interact with the Open Context site to retrieve the data. We feed wget a file that contains all of the urls that we wish to grab, and it saves all of the data into a single file. So, open a new text file and paste our search URL in there like so:
https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&q=Poggio https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=100&q=Poggio https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=200&q=Poggio
…and so on until we’ve covered the full 4000 objects. Tedious? You bet. So we’ll get the computer to generate those URLS for us. Open a new text file, and copy the following in:
#URL-Generator.py urls = ''; f=open('urls.txt','w') for x in range(1, 4000, 100): urls = 'https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=%d&q=Poggio/\n' % (x) f.write(urls) f.close
and save it as
url-generator.py. This program is in the
python language. If you’re on a Mac, it’s already installed. If you’re on a Windows machine, you’ll have to download and install it. To run the program, open your terminal (mac) or command prompt (windows) and make sure you’re in the same folder where you saved the program. Then type at the prompt:
This little program defines an empty container called ‘urls’; it then creates a new file called ‘urls.txt’; then we tell it to write the address of our search into the urls container. See the %d in there? The program writes a number between 1 and 4000; each time it does that, it counts by 100 so that the next time it goes through the loop, it adds a new address with the correct starting point! Then it saves that container of URLs into the file urls.txt. Go ahead, open it up, and you’ll see.
Now we’ll feed it to wget like so. At the prompt in your terminal or command line, type:
wget -i urls.txt -r --no-parent -nd –w 2 --limit-rate=10k
You’ll end up with a lot of files that have no file extension in your folder, eg,
Select all of these and rename them in your finder (instructions) or windows explorer (instructions), such that they have a sensible file name, and that the extension is
.json. We are now going to concatenate these files into a single, properly formatted, .json file. (Note that it is possible for wget to push all of the downloaded information into a single json file, but it won’t be a properly formatted json file – it’ll just be a bunch of lumps of difference json hanging out together, which we don’t want).
npm install -g json-concat (mac users, you might need
sudo npm install -g json-concat).
This installs the json-concat tool. We’ll now join our files together:
# As simple as this. Output file should be last $ json-concat file1.json file2.json file3.json file4.json ouput.json
… for however many json files you have.
You now have downloaded data from Open Context as json, and you’ve compiled that data into a single json file. This ability for data to be called and retrieved programmaticaly also enables things like the Open Context package for the R statistical software environment. If you’re feeling adventurous, take a look at that.
In Part Two I’ll walk you through using JQ to masage the json into a csv file that can be explored in common spreadsheet software. (For a detailed lesson on JQ, see the programming historian, which also explains why json in the first place). Of course, lots of the more interesting data viz packages can deal with json itself, but more on that later.
And of course, if you’re looking for some quick and dirty data export, Open Context has recently implemented a ‘cloud download’ button that will export a simplified version of the data direct to csv on your desktop. Look for a little cloud icon with a down arrow at the bottom of your search results page. Now, you might wonder why I didn’t mention that at the outset, but look at it this way: now you know how to get the complete data, and with this knowledge, you could even begin building far more complicated visualizations or websites. It was good for you, right? Right? Right.