Feb. 6, 2013
“What is a datafest?” I was asked.
“We’re going to be taking data and creating projects that look at the influence of money in politics. They’re also calling it a hackathon,” I said.
“You’re a computer guy?”
Photo by Teresa Bouza
A team of journalists and programmers won "Best of Show" for combining federal financial data and music.
I was struggling to explain to a friend what I would be doing at Columbia University last weekend — although I didn’t know what I would be doing. The Bicoastal Datafest, put together by a coalition of Google and data journalism groups, including the Sunlight Foundation, and underwritten with a grant from the MacArthur Foundation, brought together a wide range of professionals, including journalists, coders and Ph.D. students in mathematics and physics.
While I didn’t always understand what was going on, my nervousness was eventually overtaken by the idea that this is all pretty cool. Teams of varying sizes, at Columbia on the East Coast and Stanford on the West Coast, sought to use data to examine the topic of money in politics. Teams looked at congressional attendance, political donations by owners of immigration detention centers and even world leaders’ salaries compared to the GPD of their respective countries. Prize money was at stake, with a deadline of 3 p.m. Sunday to make it all work.
Our own team set out to find whether fraud in county financial reports could be detected using Benford’s Law, also called the “first-digit rule.” This rule of numerical analysis states that in any data set, certain numbers will have a greater frequency in the first position than others. When potential fraudsters try to “make up” random numbers on balance sheets, their intentions to appear random actually are not. The rule is used in financial fraud investigations for businesses, but no programming model for scouring county-level data using the rule existed. I wondered how helpful I could be as the only one without any experience in finance or programming. But I had no shortage of work searching for states that had their financial reports available online, in an easily formatted fashion. More than 20 hours of work later, we had 25 years of county financial data from California, Iowa and North Carolina and started coding, using a programming language known as ‘R’, which could scan the data for its adherence to the Benford Rule.
Another team married house music and Treasury Department financial data — typically two things that are not thought of as going together. The project won Best of Show for both coasts and Best Innovation in New York. The “FMS Symphony” took daily reports from the U. S. Treasury Department of spending and borrowing, scraped and parsed them from difficult-to-read text files and created code to make all of this easily readable. The information is also communicated through music with different chords in the “song” representing different data points.
“We liberated the data for the people,” said one team member as he accepted their award.
My own team, “Team B” (Team A was already taken), won an award as well: Best Potential. While our still-unnamed code does not detect actual fraud, it does call attention to irregularities in county spending and could be used for further exploration by people like me just looking for a story.
“I have seen the future of journalism and it’s big data,” said Columbia keynote speaker Steve Engelberg, editor-in-chief of ProPublica.
This weekend turned out to be an encouraging one for an industry often referred to as “in decline.” While journalism as we know it may change and require different skills, the people in our potential audiences will always want to know. And the possibilities of what we can find out may be boundless with the help and collaboration of colleagues across disciplines.
Greg Jarmuska of CensusReporter.org conducts a workshop via Google Hangout. Video posted by Kathy Kiely of the Sunlight Foundation.