By Luiz Fernando Toledo
A few years ago, I worked on a project for a large Brazilian television channel whose objective was to analyze the profiles of more than 250 guardianship counselors in the city of São Paulo. These elected professionals have the mission of protecting the rights of children and adolescents in Brazil.
Critics had pointed out that some counselors did not have any expertise or prior experience working with young people and were only elected with the support of religious communities. The investigation sought to verify whether these elected counselors had professional training in working with children and adolescents or had any relationships with churches.
After requesting the counselors’ resumes through Brazil’s access to information law, a small team combed through each resume in depth—a laborious and time-consuming task. But today, this project might have required far less time and labor. Rapid developments in generative AI hold potential to significantly scale access and analysis of data needed for investigative journalism.
Many articles address the potential risks of generative AI for journalism and democracy, such as threats AI poses to the business model for journalism and its ability to facilitate the creation and spread of mis- and disinformation. No doubt there is cause for concern. But technology will continue to evolve, and it is up to journalists and researchers to understand how to use it in favor of the public interest.
I wanted to test how generative AI can help journalists, especially those that work with public documents and data. I tested several tools, including Ask Your PDF (ask questions to any documents in your computer), Chatbase (create your own chatbot), and Document Cloud (upload documents and ask GPT-like questions to hundreds of documents simultaneously).
These tools make use of the same mechanism that operates OpenAI’s famous ChatGPT—large language models that create human-like text. But they analyze the user’s own documents rather than information on the internet, ensuring more accurate answers by using specific, user-provided sources.
In one of these tests, I retraced the 2019 investigation on guardianship counselors. I obtained CVs of newly-elected representatives and uploaded the files to Document Cloud, which is maintained by the MuckRock Foundation, a nonprofit collaborative news site that specializes in sharing and analyzing government documents. Next, I used the GPT 3.5 Turbo add-on. This plugin allows the user to ask questions of a set of documents that they have uploaded to the platform, and then get answers like those from ChatGPT. It’s as if a trained team works for you.
I asked two questions, just like in the 2019 project: Does this person have the experience to act as a guardianship counselor? And, does this person mention any relationship with churches on their resume? In a few minutes, the program returned a spreadsheet with the answers. It worked!
I also experimented with other types of AI. I developed a chatbot that answers questions based on public documents provided by the Brazilian government in response to previous requests filed under the access to information law. If someone asks, for example, where to obtain data on gun registrations in Brazil, the bot will provide a link to the official website of the Federal Police. If someone asks whether public servant resumes are considered public information, they will be told yes, and will receive precedents from other people who have previously asked about the topic.
In both cases, the technology produced errors that had to be corrected manually. This is a problem, especially since journalism must be accurate. Yet, this is similar to the verification required in traditional journalism. Any data analysis task, whether manual or automated, needs to undergo thorough fact-checking. Using artificial intelligence doesn’t change this fundamental principle of investigative journalism.
To support investigative, data-driven journalism in the age of AI, journalists around the world need access to generative AI tools and training on how to use them ethically. These new technologies are powerful tools for investigative reporting, but to make effective use of them, journalists need information on the types of tools available, training on how to use them and fact-check their outputs, and financial resources to purchase tools that aren’t freely available.
Try it yourself! Below are some generative AI platforms you can use in your reporting.
DocumentCloud: DocumentCloud is a platform for managing primary source documents that can summarize, highlight key passages, and ask questions of documents. There are free and premium versions and this platform requires no prior coding experience.
Google Pinpoint: This free platform uses AI to identify locations, people, and organizations in a document and count how many times they appear. It can also identify patterns from a PDF and turn it into a spreadsheet.
Aleph (OCCRP): Aleph is a useful starting point for any investigation, especially those involving cross-border crimes. Aleph is a platform that allows users to check mentions of a person in millions of records from official documents, leaks, public records, and other sources. Aleph uses AI in many different ways to categorize and read documents from different sources and can even combine records related to the same person.
Ask Your PDF: This tool allows users to upload a document and send it questions as if talking to it.
GPTs: The most famous AI tool, ChatGPT, now offers the option to create and train your own chatbot with a paid premium account. No coding experience required.
Chatbase: Chatbase allows users to create a chatbot for their own website and requires no coding experience.
Luiz Fernando Toledo is an investigative journalist specializing in data and public documents, with more than a decade of experience in some of the Brazil’s largest media outlets (CNN, Estadão, TV Globo, Revista Piauí, UOL). He is a former Reagan-Fascell fellow at the National Endowment for Democracy (NED), director of the Brazilian Association of Investigative Journalism (Abraji), and founder of the DataFixers.org project, a data analysis consultancy. He holds a master’s degree in data journalism from Columbia University and in public administration from Fundação Getulio Vargas (EAESP-FGV).
Comments (0)