Smart Harvesting II

Profile & Description

Utilizing Web Scraper Technology for Bibliography Building

In the project “Smart Harvesting II”, software-based solutions for the collection and processing of bibliographic data from the web are developed. Up to now, this work has been done manually in many facilities and is therefore very labour-intensive and time-consuming. In other cases, where technical support is already being used, specialized software programs, known as wrappers, are used for this purpose, which have to be created and maintained by expert software developers. The focus of our project is therefore on the development of low-maintenance wrappers, which can be easily operated by non-information scientists and continuously adapted to new website structures. For this we rely on the open source solution OXPath - an extension of XPath that allows declarative imitation of the interaction with a website and in this context can extract data in a targeted way. We are convinced that the use of OXPath allows the integration of archivist/librarians into the process of creating and maintaining wrappers, since they are often already familiar with the basics of XML and XPath.

Funding Agency

DFG - Deutsche Forschungsgemeinschaft

Partner Institution

dblp - Computer Science Bibliography @ University of Trier
GESIS - Leibniz Institute for the Social Sciences

People Involved

Prof. Dr. Philipp Schaer (Technische Hochschule Köln)
Mandy Neumann (Technische Hochschule Köln)

ProjectSmart Harvesting II

Duration

2016 - 2019

Funded by

External Resources

GitHub Organization

Fork us on GitHub.

https://github.com/smart-harvesting

Project Website

Visit the project's home.

https://smart-harvesting.github.io

Publications

2020

Conference Indexing in Digital Libraries: A Ranking Model and Case Study on dblp.
In: Proceedings of the 10th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 42nd European Conference on Information Retrieval, BIR@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online only], pages 30-41. 2020.
Christopher Michels, Mandy Neumann, Philipp Schaer and Ralf Schenkel.
[doi] [pdf] [BibTeX]

2019

Information Extraction for Semi-structured Email Corpora.
In: R. Jäschke and M. Weidlich, editors, LWDA, volume 2454, series CEUR Workshop Proceedings, pages 322-330. CEUR-WS.org, 2019.
Hendrik Adam and Philipp Schaer.
[pdf] [BibTeX]

2018

Introduction to OXPath.
2018. cite arxiv:1806.10899Comment: 63 pages.
Ruslan R. Fayzrakhmanov, Christopher Michels and Mandy Neumann.
[doi] [pdf] [abstract] [BibTeX]

Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp.
In: JCDL '18 Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries , pages 45-48. ACM, New York, NY, USA, 2018.
Mandy Neumann, Christopher Michels, Philipp Schaer and Schenkel Ralf.
[doi] [pdf] [abstract] [BibTeX]

2017

Web-Scraping for Non-Programmers: Introducing OXPath for Digital Library Metadata Harvesting.
Code4Lib Journal, 38, 2017.
Mandy Neumann, Jan Steinberg and Philipp Schaer.
[doi] [pdf] [abstract] [BibTeX]

Enriching Existing Test Collections with OXPath.
In: G. J. F. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato and F. Nicola, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, volume 10456, series Lecture Notes in Computer Science. 2017.
Philipp Schaer and Mandy Neumann.
[doi] [pdf] [abstract] [BibTeX]

Information Retrieval Research Group

IR Research Group

Technische Hochschule Köln