Utilizing Web Scraper Technology for Bibliography Building
In the project “Smart Harvesting II”, software-based solutions for the collection and processing of bibliographic data from the web are developed. Up to now, this work has been done manually in many facilities and is therefore very labour-intensive and time-consuming. In other cases, where technical support is already being used, specialized software programs, known as wrappers, are used for this purpose, which have to be created and maintained by expert software developers. The focus of our project is therefore on the development of low-maintenance wrappers, which can be easily operated by non-information scientists and continuously adapted to new website structures. For this we rely on the open source solution OXPath - an extension of XPath that allows declarative imitation of the interaction with a website and in this context can extract data in a targeted way. We are convinced that the use of OXPath allows the integration of archivist/librarians into the process of creating and maintaining wrappers, since they are often already familiar with the basics of XML and XPath.
GESIS - Leibniz Institute for the Social Sciences
Dr. Michael Ley (dblp)
Prof. Dr. Brigitte Matthiak (GESIS)
Christopher Michels, M.A. (dblp)
Mandy Neumann, M.A. (TH Köln)
Prof. Dr. Philipp Schaer (TH Köln)
Further Reading: General Code and Datasets Publications
If you are interested in background knowledge and would like to learn more about the Smart Harvesting II project and its progression feel free to study the following additional information.
General
Code and Datasets
Publications
In: Code4Lib Journal, 38, 2017.
Mandy Neumann, Jan Steinberg and Philipp Schaer.
[doi] [abstract] [BibTeX]
In: G. J. F. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato and F. Nicola, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, volume 10456, series Lecture Notes in Computer Science. 2017.
Philipp Schaer and Mandy Neumann.
[doi] [pdf] [abstract] [BibTeX]