Data journalism is a new journalistic discipline that focuses particularly on data-driven research and presentation formats. However, a fundamental problem of data journalism, but also of classical journalism, is that much data of journalistic interest is only available in unstructured form: as texts, tables and graphics in documents of various types (Word, PDF, e-mail, etc.) or on websites.
The project Journalistic Information Extraction (JoIE) aims to address the problem of information extraction from unstructured sources, that are relevant for (data) journalism. Based on the two state-of-the-art tools Workbench and Fonduer, a solution is going to be developed that can handle the above-mentioned data sources and makes them usable for journalism by putting them into a structured and thus analyzable form. Workbench is a web-based platform for the preparation and analysis of data, which allows, among other things, the extraction of web data. Fonduer is a toolkit that uses the latest methods of artificial intelligence to automatically learn extraction patterns, e.g. for the recognition of tables. Both applicants, the Science Media Center (SMC) and the working group around Professor Schaer at TH Köln - University of Applied Sciences have already successfully worked together in the field of information extraction and have the corresponding experience and expertise.
Within a 36-month research and development phase, a software system is going to be developed in JoIE as part of a PhD project. The system will integrate the two components Workbench and Fonduer and will have an interface based on the principles of Learnable Programming. With the help of this software system, the problem of information extraction, which is the driving force for (data) journalism, will be addressed. The requirements of data journalists derived from a requirement engineering phase will serve as a basis for the system design and is going to be investigated in corresponding user tests and evaluations. The work combines theoretical approaches of Human-Computer-Interaction (HCI) and practical implementations derived from real-world applications and thus represents a research desideratum.