Using Project Documents to Simplify Location Data Publication
Location data matters. Despite being one of the least published pieces of data on the International Aid Transparency Initiative (IATI) Standard, it is consistently highlighted as one of the most important. Without information on where donors are spending their money, aid practitioners are less able to avoid project duplication or identify gaps in funding. As part of the recent Tool Accelerator Workshop, organized by the Initiative for Open Ag Funding, a team of staff from GlobalGiving, Development Gateway and Publish What You Fund worked together to see if we could enhance and simplify the publication of this vital, yet lacking, data.
The idea was simple: identify location names based on unstructured project documents and turn them into geocodes so that they could, for example, be plotted on a map. Like every data science project, our first step was to find some reliable input data upon which to build. We found a treasure trove of potential sources embedded in the document links included in IATI activity files. Within these, evaluation documents seemed to be the best potential source of location data, so we created a library of these files to quickly sample from.
Documents in hand, we now had to try to extract the relevant location names and other useful information. Unfortunately, the types of documents that organisations attach to IATI activities are extremely diverse and come with a wide variety of structures and file formats. To begin to make sense of them, we first converted them into raw, uniform text. We then used a technique called named entity recognition to distill the messy raw text into the clean lists of organisations and locations hidden inside.
Thanks to Python’s natural language processing tools, we were able to begin iterating quickly on our corpus of documents and quickly discovered what data points could be easily extracted. The excerpts of our code on Github describe this process in more detail. In the end, our result looked something like this (emphasis added):
We now had a list of sub-national locations; some relevant, some not. Our next step was to convert these names into geocodes. Much of our exploration at this stage focused on whether these results could be integrated with Development Gateway’s Open Aid Data Geocoder, which can assign precise geospatial data based on the location names in accordance with the IATI schema.
Several practical hurdles remain before an approach like this could be used in production, but by the end of this short exercise we had discovered a simple way of automating IATI data creation by scraping text from project documents, extracting location data and then converting these results into geocodes.
If we are to take this idea further, our next step would be to identify the earliest instance in which subnational locations appear in available project documentation. Since evaluation documents are only available at the end of a project’s lifecycle, they are not an ideal choice for completing forward-looking IATI data, although they could be used to enhance data already on the IATI registry. We will also need to test ways to implement this approach at scale before integrating it into donor publication processes.
In any case, what we have learned so far is that location data is central to improved coordination and, with this automated extraction and geolocation process, it is well worth the effort to explore how we can improve its publication and availability.