JoIE - Journalistic Information Extraction

Profile & Description

Data journalism is a new journalistic discipline that focuses particularly on data-driven research and presentation formats. However, a fundamental problem of data journalism, but also of classical journalism, is that much data of journalistic interest is only available in unstructured form: as texts, tables and graphics in documents of various types (Word, PDF, e-mail, etc.) or on websites.

The project Journalistic Information Extraction (JoIE) aims to address the problem of information extraction from unstructured sources, that are relevant for (data) journalism. Based on the two state-of-the-art tools Workbench and Fonduer, a solution is going to be developed that can handle the above-mentioned data sources and makes them usable for journalism by putting them into a structured and thus analyzable form. Workbench is a web-based platform for the preparation and analysis of data, which allows, among other things, the extraction of web data. Fonduer is a toolkit that uses the latest methods of artificial intelligence to automatically learn extraction patterns, e.g. for the recognition of tables. Both applicants, the Science Media Center (SMC) and the working group around Professor Schaer at TH Köln - University of Applied Sciences have already successfully worked together in the field of information extraction and have the corresponding experience and expertise.

Within a 36-month research and development phase, a software system is going to be developed in JoIE as part of a PhD project. The system will integrate the two components Workbench and Fonduer and will have an interface based on the principles of Learnable Programming. With the help of this software system, the problem of information extraction, which is the driving force for (data) journalism, will be addressed. The requirements of data journalists derived from a requirement engineering phase will serve as a basis for the system design and is going to be investigated in corresponding user tests and evaluations. The work combines theoretical approaches of Human-Computer-Interaction (HCI) and practical implementations derived from real-world applications and thus represents a research desideratum.

The project was annouced by a press release at idw-online (in German).

Funding Agency
Klaus Tschira Stiftung
Partner Institution
Science Media Center
People Involved
Prof. Dr. Philipp Schaer (Technische Hochschule Köln)
Dr. Meik Bittkowski (Science Media Center)
Björn Engelmann, M.Sc. (Technische Hochschule Köln)

ProjectJoIE - Journalistic Information Extraction

Duration
2020 - 2023
Funded by

External Resources

Publications

2024

Context-Driven Interactive Query Simulations Based on Generative Large Language Models.
In: ECIR 2024. 2024.
Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer and Norbert Fuhr.
[pdf]  [BibTeX] 
BATS: BenchmArking Text Simplicity.
In: L.-W. Ku, A. Martins and V. Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 11968-11989. Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 2024.
Christin Kreutz, Fabian Haak, Björn Engelmann and Philipp Schaer.
[doi]  [abstract]  [BibTeX] 
Report on the Collab-a-Thon at ECIR 2024.
SIGIR Forum, 58(1):1-11, 2024.
Sean MacAvaney, Adam Roegiest, Aldo Lipani, Andrew Parry, Björn Engelmann, Christin Katharina Kreutz, Chuan Meng, Erlend Frayling, Eugene Yang, Ferdinand Schlatt, Guglielmo Faggioli, Harrisen Scells, Iana Atanassova, Jana Friese, Janek Bevendorff, Javier Sanz-Cruzado, Johanne Trippas, Kanaad Pathak, Kaustubh D. Dhole, Leif Azzopardi, Maik Fröbe, Marc Bertin, Nishchal Prasad, Saber Zerhoudi, Shuai Wang, Shubham Chatterjee, Thomas Jänich, Udo Kruschwitz, Xi Wang and Zijun Long.
[doi] [pdf]  [BibTeX] 

2023

Reliable Rules for Relation Extraction in a Multimodal Setting.
In: B. König-Ries, S. Scherzinger, W. Lehner and G. Vossen, editors, Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings, volume P-331, series LNI, pages 1009-1021. Gesellschaft für Informatik e.V., 2023.
Björn Engelmann and Philipp Schaer.
[doi] [pdf]  [BibTeX] 
Text Simplification of Scientific Texts for Non-Expert Readers.
In: SimpleText@CLEF-2023, volume abs/2307.03569, series CEUR Workshop Proceedings. 2023.
Björn Engelmann, Fabian Haak, Christin Katharina Kreutz, Narjes Nikzad-Khasmakhi and Philipp Schaer.
[doi] [pdf]  [BibTeX] 
Automated Statement Extraction from Press Briefings.
In: B. König-Ries, S. Scherzinger, W. Lehner and G. Vossen, editors, Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings, volume P-331, series LNI, pages 1049-1057. Gesellschaft für Informatik e.V., 2023.
Jüri Keller, Meik Bittkowski and Philipp Schaer.
[doi] [pdf]  [BibTeX] 

2021

IRCologne at TREC 2021 News Track - Relation-based re-ranking for background linking.
In: TREC. National Institute of Standards and Technology (NIST), 2021.
Björn Engelmann and Philipp Schaer.
[pdf]  [BibTeX]