Why Political Finance Data Needs Open Source
19 Jan 2019 —Did you know that information on the personal finances of anyone running for US Congress, the Presidency, and any number of federal offices is available to the public? It’s true, and an organization has collected all this information, processed it, and put into a single format and location for free! A perfect treasure trove for data scientists and researchers!
The only catch is… it sucks.
Well, OpenSecrets.org doesn’t exactly “suck”, but it has some relatively fatal flaws. In this post I describe my frustration with OpenSecrets and why their goals would be better served by moving their project to full open source.
Background
A few years ago, an event happened that made me realize that anyone can be successful in politics.
In the following years, a near constant string of events have made me realize that I could probably make better decisions than many of these elected officials. But when I compare myself to successful politicians, I notice there’s one thing they have that I don’t: cash.
As a researcher, scientist, and data scientist, I wanted to ask a question: how much money do you need to get elected? I came into this area knowing literally nothing about political finances, how to get the data, or what I should have expected—I just wanted to ask a few questions and see what the data suggested. What followed was a series of yo-yo-like ups and downs, alternating between being hopeful and impressed to disappointed and jaded.
OpenSecrets.org Data
I’ll preface this with the following: if you know a better place to get personal financial disclosure (PFD) data, please leave a comment below. I could have very well spent a lot of time looking in the wrong places. But here’s a very shortened summary of what I went through:
OpenSecrets.org, the website for the Center for Responsive Politics is a website dedicated to tracking money in politics. If you’re looking for free comprehensive data on politicians’ personal or campaign finances, they’re just about the only shop in town. The idea behind the organization’s work—helping reporters, researchers, and policy-makers sort through lobbying money—is brilliant and much-needed. However, as with so many amazing projects for the public’s good, I’m pretty sure they’re also understaffed and underfunded.
Although their website is well organized and very user-friendly, it really only seems useful for giving higher-level summaries of politicians’ funding and assets. To really do anything right—whether it’s investigative journalism or thorough analyses, you need the actual data. And let me tell you, their actual data is anything but well organized and user-friendly.
Criticism #1: No effort is spent making data readable
Here’s the thing: I know OpenSecrets has their data stored in a useable form for the backend of their website, given their API. I also know that pretty much any format of that database, it should be relatively trivial to output that data into pretty much any flavor of tab-delimited file. So why do they put it in the weirdest, jankiest format possible? It’s technically a comma-separated value file, but strings are quoted by “|”, nothing is escaped in those strings, and there are some rows that span multiple lines.
The choice of quoting character is odd, but not “sinful”. The fact that they don’t escape anything is. It just so happens that if you break things up by “|,”,1 it will “mostly” work, But the second a politician uses “|,” as a string in any descriptions of their assets, all their data will fail to be machine-readable. And there are hundreds of no-effort libraries that can escape special characters, so their choice not to is baffling.
Although it probably would have been easier to read via Python, I wanted to use R. I ended up using Python to convert it to something R could read with what ended up being just some regex.2 And even then, I think there were still a few cases I had to correct by hand.
Criticism #2: No “unit” testing, no standardization, no checking
My realization that maybe the CRP was underfunded first started dawning on me when I was reading their documentation and regularly found spelling errors and typos. “Rerfers” for “refers”, “subsiquent” for “subsequent”, “is filer jointly with spouse” for “is file[d] jointly with spouse”, etc. These are mistakes that basic spell-checking should catch—what were they doing in so many “official” docs?
But the data has many problems that should have been caught by even semi-diligent sanity checks. There are small, relatively unimportant things like inconsistent capitalization of coding values and whether NAs are represented by " "
or ""
, etc. But there are also bigger issues.
For example, OpenSecrets doesn’t have a winner for these three races in 2014, and a bunch of other years. Or, according to the docs, there are only supposed to be four types of PFD reports, represented by “Y”, “A”, “N”, or “T”. So what the heck do the 124 entries that are marked “C” or “O” in the income data mean?
It would be trivial to write code that ensures that the definitions outlined in the documentation actually apply to the data. Keeping the definitions and these tests together would also ensure that whenever the definitions were updated, the docs would be too. These basic sanity checks could also be easily expanded to other sorts of tests. For example, we know that each race needs to have exactly one winner. It took me about three lines of code to check for that. If, by some act of God, this wasn’t always true, it’s so vastly true most of the time that we’d want to know what the exceptions are anyway. Or, for example, you can easily check to make sure any candidate who won must be in a primary,3 or can’t run twice in the same primary for the same position. These problems could in theory be caught very easily, but there are other issues with the data that seem more insidious.
Criticism #3: Lack of transparency – weird data
Although the Center for Responsive Politics aims to increase transparency in politics, its own methods are relatively opaque. In theory, this doesn’t have to be a problem if the organization always did everything right, but we all know that big projects like these invariably have screw-ups, and without transparency, these possible mistakes become quagmires for anyone using the data.
When trying to decipher the “C” and “O” PFD report types, I stumbled across data that, at first, just seemed wrong. For example, in the income data file, PFDincome.txt
, the row with ID
Z140277592
. It is purportedly from a 2014 filing from Trent Kelly, a Representative from Mississippi, and says his wife was paid $15,117 by “Grammer Inc.” The problem is, as far as I can tell, Kelly never filed in 2014, and none of his later filings mention Grammer Inc.
However, I did some sleuthing and found that his wife, Sheila Kelly, does work for a Grammer Inc. Further sleuthing (i.e., last-minute fact-checking) found that Kelly actually filed twice in 2015, once as an admitted member of the House of Representatives, and once as a candidate. I found that his candidate filing has data that matches the data that OpenSecrets evidently uses. But the data OpenSecrets uses is clearly meant for 2015. But perhaps the “C” report types are from “candidate” reports? So that… almost makes sense—it would seem like they just got the date wrong!
Sadly, no. First, there are essentially only 33 unique “C” reports in the assets data, dating back to 2009. Over 200 incumbents were defeated in the House in 2016 alone, so there should be much more candidate entries than that. Secondly, OpenSecrets does not have any PFD data on Rep. Kelly after “2014”, despite him still being a House Rep! This issue connects back to the need for testing, but it seems insane that OpenSecrets doesn’t check for current Congressmen with no information. Perhaps you can see why I stopped trusting their data.
Criticism #4: Lack of transparency – literally giving up
Did you know that since 2014, the Center for Responsive Politics no longer keeps track of PFDs that are filed via paper, despite the fact that they are available online? Neither did I! I only found out accidentally!
I learned that the CRP seems to have stopped paying people for manual data entry for PFDs in 2014 after noticing Dianne Feinstein doesn’t have any data past that point. It would be odd for an established member of Congress to just forget about filing, so I checked the Senate website and she has been filing, just non-electronically. The scans of her paper forms are available online as .pdf
files.
However, most other Senators and canidates file electronically, which means their PFDs are available on the Senate website as easily parsable HTML data (which I have since scraped myself). Following a hunch that maybe the reason the CRP doesn’t have any of her data is because of the format, I decided to lean hard into the stereotype that old people don’t know how the internet works. After all, Dianne Feinstein is the oldest current Senator, so if she resisted the new electronic filing, maybe some of the other elderly Senators had too.
My stereotype wasn’t totally correct, but I discovered similar gaps for James Inhofe, Chris Van Hollen, and Tom Cotton, who all file non-electronically as well. It would seem that the CRP has set up some sort of web-scraper for the Senate website to automatically grab the PFD data, but has given up keeping anything that isn’t already machine-readable. Which would be bonkers unless the organization is in its death throes funding-wise. Some of the most powerful members of Congress basically have their personal assets become invisible!
But it would be even more bonkers to make that policy change without telling anybody, especially the people trusting your data. And as far as I can tell, that’s exactly what they did.
All these problems stem, in my mind, from one single issue that is crippling OpenSecrets. Luckily, there’s a single fix for everything I’ve mentioned.
Solution: Go open source
For OpenSecrets to be viable—or for any organization that wants to do something similar—it’s clear to me that the way forward is to go full open source, at least for the data collection and curation. I can imagine this happening in a GitHub-style collaborative environment very easily—comparable to RStudio’s open source projects.
Not only does open source better fit with CRP’s mission of “champion[ing] transparency”, but going open source will let it actually achieve its goal to “produce and disseminate peerless data […] on money in politics”.
Making do with less funding
Let’s face it: most non-profits are not exactly rolling in cash. The CRP has a limited budget, and I’m guessing a lot of that has to go to other things.
Open source projects have a history of getting things done with much less funding than similar non-open projects—they let people contribute for free! I’ve always thought of open source projects as being fueled by “nerd passion”, but imagine combining that passion with people’s desire to hold their government accountable. We’re talking rocket fuel here!
The use-contribute cycle
No one knows the problems of a system better than the people trying to use it, and those are the ones who want to improve it the most. It’s in a project’s best introduce to reduce the barrier to contributing as much as possible. Imagine if the OpenSecrets project was one GitHub! You’d have pull requests correcting the typos, improving the docs, and standardizing the data in a blink of an eye!4
In fact, people would probably be willing to manually input (and double check) any data that couldn’t be scraped, such as Senator Feinstein’s PFDs, for free. People would also be able to see and propose new sanity checks to be run automatically so that data collection continually grows more robust, making the project more self-sufficient as time goes on!
Real transparency
But perhaps the most important aspect of becoming open source is that you would be making it possible for other people to actually understand and trust the data. A GitHub-like platform would automatically keep track of changes, discussions, and explanations for why things are the way they are, but you could also envision major changes being explained in something like “releases” as well, similar to the NEWS
file in R packages. Claims of partisanship would be hard to make when all decisions about the data are visible and above-board!
Even if the data collection was somehow limited, users should know those limitations—if there were any errors in the data, researchers could track them down and isolate their source, rather than having to treat all the data as dubious.
Conclusion
I hope OpenSecrets.org can improve, and if they aren’t willing to, I think it’s high time for a real open source project to step in and take up their mantle. I’ve learned a lot working with this data, but I ultimately decided that the OpenSecrets data couldn’t fit my needs regardless of the flaws I’ve outlined above: I wanted to compare the finances of those who lost and those who won their races, and the CRP doesn’t store candidates’ finances past the election unless they win, which makes sense given limited space.
I ended up scraping the data myself from the Senate’s website. I wanted to try out pandas
, so I made some Python classes that store the report data in pandas
data tables—it worked pretty well! I’ve saved the information, but I’ve since realized that using Senate race data for my comparisons doesn’t exactly make sense—the incumbent advantage seems strong enough to wipe out any effects of personal finance, and there were only 25 open seats in 2018 anyway. If I get the time, I’ll try to get data from the House, but that requires some .pdf
conversion that I’m not really looking forward to…
Source Code:
This is the code I used to do higher-level cleaning and organizing for some of the OpenSecrets PFD data. You can download the data yourself here for free if you’re not using the data for commercial purposes.
Footnotes:
-
It’s actually more complicated than that, but that solves most of the problems. ↩
-
It ended up being this whopper:
re.sub(r"(?<!,)(?<!^)(?<!\\)(?<!\n)\|(?!(,|$|\n))", "\\|", ...)
↩ -
OpenSecrets has Michele Bachmann in 2012 being the only candidate to do so in the years 2012-2018, so I’m assuming being able to do is a mistake in how they coded the data. ↩
-
I actually reported these problems to them via email, hoping to improve the project, but I was met with an auto-reply basically saying “we’re too busy, we might get to it eventually”. Open source management systems like those on GitHub would make managing these requests a breeze, comparatively. ↩