I got 99 reps to scrape so I learned me some Python
Published 2011-12-30
Updated: I've added some excellent information and knowledge from Ryan Pitts, who took the time to walk me through a couple things on a Saturday morning. Much thanks.
In getting ready for the 2012 elections, and politics in general, it makes sense to grab data and information about Wisconsin’s state senators and state representatives.
With last summers senate recall elections, I had already gone through and grabbed names, photo urls, website links and contact information for the state’s 33 senators.
On each senator’s state webpage, state capitol and district contact information and biographies are also listed.
So to add those I bit the bullet — there are only 33 after all, and speed was a factor — so I grabbed that information using an ImportHtml in a Google spreadsheet combined with some old-fashioned find and replace and copy and paste.
Here’s an example
ImportHtml(“http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Senate&district=1”, “table”, “7”)
But the state assembly. Now there’s a daunting task. The state has 99 representatives, and the thought of the ImportHtml method and CTL-C, CTL-V … repeat … was not something I looked forward to.
So through Kevin Schaul’s Web scraping with Django tutorial I had played around a bit with the Requests -- a Python library? module? And through some of the really basic tutorials on ScraperWiki I had figured out bit about lxml.
And today seemed like a good time to put it all together? And then some.
The basic scraper came together fairly easy… But scraping a URL, copying the content from the terminal into a spreadsheet, changing the URL … repeat … wasn’t much of an answer. So why not take the time to LEARN SOME PYTHON instead of just schlepping my way through a task.
Each state representative has a webpage that can easily be determined by the district number.
This URL —
http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Assembly&district=1
— belongs to 1st Assembly District Garey Bies
And to grab a bio for Garey Bies, the url has “&display=bio” appended to the end.
The javascript I have learned over the past year gave me an idea on how I could combine a url and a variable together.
And I remembered this little Python tutorial from last spring, so I knew a bit about looping through results. So all I had to do was figure out how to make all of this happen… And the write the output to a CSV?
Well I nearly pulled all that off, save for a couple things that didn’t quite work the way I had hoped — once I ran into encoding issues, I knew I was veering off the path — but to say this was a “foundational” learning experience is an understatement, and I accomplished the task I had at hand.
The code is below, and at this late hour it all looks like mush… but the comments — I comment everything and likely will until shamed into doing otherwise, but it's something that the reporter and editor in me believes in — walk through what is happening.
So I just let the bios of 99 reps output to the terminal, copied that to the text editor, did some find and replace, pasted into the spreadsheet and I was done… And am left feeling I learned a lot in the process. And if allowed to get a bit sappy... It was a really encouraging way to close out what has been a tremendous year for personal learning.
I did run into a handful of obstacles, and am not equipped to figure out a solution…
- It would have been slick to figure out to add some regular expressions to format the output.
- My attempt to write to a CSV was successful, though every space in the output was replaced with a comma. There’s sure to be some formatting that could be used there.
- Even the attempt to write to a text file wasn’t truly successful, as I ran into the following error after the first pass: UnicodeEncodeError: ‘ascii’ codec can’t encode character u'\u2019' in position 269: ordinal not in range(128). So I threw that plan away, but the code remains and is commented out.
Thanks to Ryan Pitts, I've learned that adding .encode('utf-8') to data = el.text_content().strip() gets me past the error when trying to write the output to a text file. But as he points out, there will be other things to deal with, so I'll keep working on that...
@ChrisLKeller Adding that .encode() will make your write work fine, but you'll have stuff like 'Sheriff’s Dept.' to handle later.
— Ryan Pitts (@ryanpitts) December 31, 2011
And through Ryan I also learned about Beautiful Soup -- will need more time to look through that -- and some great resources on handling text in Python:
- Unicode How-To
- "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.
Here's the updated code... As always, pointers, tips or links to learning resources are most welcome.