On May 23-24 the Aspen Institute’s Program on Philanthropy and Social Innovation (PSI) hosted the first-of-its-kind Form 990 Datathon. The Datathon, a unique collaboration between nonprofit researchers and data scientists, brought together individuals from the Aspen Institute, Charity Navigator, GuideStar, Urban Institute, Syracuse University, Johns Hopkins University, George Washington University, and American University. The Datathon’s goal was to begin the labor-intensive process of cleaning and converting Form 990 data into more accessible, public spreadsheets.
The Datathon, led by David Borenstein (Charity Navigator) and Jesse Lecy (Syracuse University), took place close to the one-year anniversary of the IRS’s historic release of millions of electronically-filed nonprofit tax forms on Amazon Web Services (AWS). The Form 990, a tax form that all nonprofits are required to file, is a key tool in promoting transparency and accountability in the nonprofit sector: it discloses vital data on the expenses, activities and operations of tax exempt organizations. PSI’s Nonprofit Data Project has been leading the effort to democratize information on the nonprofit sector in the United States. Its landmark 2013 Form 990 report and subsequent education and policy activities have helped make this data available, along with a key lawsuit by open data expert, Carl Malamud.
While the June 2016 release of the Form 990 e-filed data by the IRS was a huge step toward free and open data on the nonprofit sector, researchers, scholars, journalists, and watchdog organizations have encountered difficulties working with the complex dataset, which is in XML. Complexities include dozens of versions of the forms (due to multiple versions of the 990), seventeen Schedules, and many IRS revisions over time. As a result, all of the versions need to be reconciled and mapped onto a Master Concordance File for the data to be meaningful.
Interested organizations and scholars were working independently on this problem, but through regular conversations among data users—now coordinated by the Aspen Institute—it became apparent that it would be beneficial to generate a unified set of standards regarding how the 990 e-filed data should be processed and documented. Thus, The Nonprofit Open Data Collective was born, a loose collaboration between leading Form 990 players like Charity Navigator, Guidestar, Urban Institute, and Aspen Institute, nonprofit scholars, and independent professionals.
The Datathon was an opportunity for this team to work together in person for the first time. Datathon participants quickly found that proximity was their ally: many of their individual frustrations with the data were shared by others. Tips and knowledge about the data set were shared from one person to another, contributing to the collaborative nature of the event.
Because the 990 data released by the IRS on AWS lacks a complete explanation of variable names and organization, thorough documentation is a necessary first step to making the data usable. During the two-day Datathon, participants processed 2,900 XML paths to define 573 variables from the 990 and 990-EZ forms.
A Master Concordance file, a combined XML path directory and data dictionary, will soon be released on the Nonprofit Open Data Collective GitHub account, followed by open datasets. The Concordance provides the means to translate between different versions of the forms and combine 990 and 990-EZ data. The Master Concordance file will provide common standards to normalize variable definitions when creating spreadsheets and searchable databases for public use from the IRS data sets. By working off of a common Concordance file on GitHub, we will be able to keep improving the data by contributing bug fixes and adding code.
As noted, the group has committed to continue collaborating to finish organizing the remaining 990 Schedules. Next steps include posting a public Concordance Master File on GitHub once it is ready for release, distributing open datasets and appropriate documentation generated with the Concordance, and expanding the Concordance to include Schedules. In addition, the group is considering planning a hackathon, to explore creative uses of the data and help define problems with the dataset. Prior to the hackathon, the group would set out to create a list of issues and priorities by engaging data users.
By hosting the Datathon, the Nonprofit Data Project reaffirmed its commitment to facilitating the public use of open data on the nonprofit sector.