Data journalism is a new journalistic discipline that focuses particularly on data-driven research and presentation formats. However, a fundamental problem of data journalism, but also of classical journalism, is that much data of journalistic interest is only available in unstructured form: as texts, tables and graphics in documents of various types (Word, PDF, e-mail, etc.) or on websites.
The project Journalistic Information Extraction (JoIE) aims to address the problem of information extraction from unstructured sources, that are relevant for (data) journalism. Based on the two state-of-the-art tools Workbench and Fonduer, a solution is going to be developed that can handle the above-mentioned data sources and makes them usable for journalism by putting them into a structured and thus analyzable form. Workbench is a web-based platform for the preparation and analysis of data, which allows, among other things, the extraction of web data. Fonduer is a toolkit that uses the latest methods of artificial intelligence to automatically learn extraction patterns, e.g. for the recognition of tables. Both applicants, the Science Media Center (SMC) and the working group around Professor Schaer at TH Köln - University of Applied Sciences have already successfully worked together in the field of information extraction and have the corresponding experience and expertise.
Within a 36-month research and development phase, a software system is going to be developed in JoIE as part of a PhD project. The system will integrate the two components Workbench and Fonduer and will have an interface based on the principles of Learnable Programming. With the help of this software system, the problem of information extraction, which is the driving force for (data) journalism, will be addressed. The requirements of data journalists derived from a requirement engineering phase will serve as a basis for the system design and is going to be investigated in corresponding user tests and evaluations. The work combines theoretical approaches of Human-Computer-Interaction (HCI) and practical implementations derived from real-world applications and thus represents a research desideratum.
The project was annouced by a press release at idw-online (in German).
Dr. Meik Bittkowski (Science Media Center)
Björn Engelmann, M.Sc. (Technische Hochschule Köln)
External Resources
Publications
2024
In: ECIR 2024. 2024.
Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer and Norbert Fuhr.
[pdf] [BibTeX]
In: L.-W. Ku, A. Martins and V. Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 11968-11989. Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 2024.
Christin Kreutz, Fabian Haak, Björn Engelmann and Philipp Schaer.
[doi] [abstract] [BibTeX]
SIGIR Forum, 58(1):1-11, 2024.
Sean MacAvaney, Adam Roegiest, Aldo Lipani, Andrew Parry, Björn Engelmann, Christin Katharina Kreutz, Chuan Meng, Erlend Frayling, Eugene Yang, Ferdinand Schlatt, Guglielmo Faggioli, Harrisen Scells, Iana Atanassova, Jana Friese, Janek Bevendorff, Javier Sanz-Cruzado, Johanne Trippas, Kanaad Pathak, Kaustubh D. Dhole, Leif Azzopardi, Maik Fröbe, Marc Bertin, Nishchal Prasad, Saber Zerhoudi, Shuai Wang, Shubham Chatterjee, Thomas Jänich, Udo Kruschwitz, Xi Wang and Zijun Long.
[doi] [pdf] [BibTeX]
2023
In: B. König-Ries, S. Scherzinger, W. Lehner and G. Vossen, editors, Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings, volume P-331, series LNI, pages 1009-1021. Gesellschaft für Informatik e.V., 2023.
Björn Engelmann and Philipp Schaer.
[doi] [pdf] [BibTeX]
In: SimpleText@CLEF-2023, volume abs/2307.03569, series CEUR Workshop Proceedings. 2023.
Björn Engelmann, Fabian Haak, Christin Katharina Kreutz, Narjes Nikzad-Khasmakhi and Philipp Schaer.
[doi] [pdf] [BibTeX]
In: B. König-Ries, S. Scherzinger, W. Lehner and G. Vossen, editors, Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings, volume P-331, series LNI, pages 1049-1057. Gesellschaft für Informatik e.V., 2023.
Jüri Keller, Meik Bittkowski and Philipp Schaer.
[doi] [pdf] [BibTeX]