Publications
Conference Indexing in Digital Libraries: A Ranking Model and Case
Study on dblp.
In:
Proceedings of the 10th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 42nd European Conference on Information Retrieval, BIR@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online only], pages 30-41.
2020.
Christopher Michels, Mandy Neumann, Philipp Schaer and Ralf Schenkel.
[doi] [pdf]
[BibTeX]
Information Extraction for Semi-structured Email Corpora.
In: R. Jäschke and M. Weidlich, editors,
LWDA, volume 2454, series CEUR Workshop Proceedings, pages 322-330.
CEUR-WS.org, 2019.
Hendrik Adam and Philipp Schaer.
[pdf]
[BibTeX]
Introduction to OXPath.
2018. cite arxiv:1806.10899Comment: 63 pages.
Ruslan R. Fayzrakhmanov, Christopher Michels and Mandy Neumann.
[doi] [pdf]
[abstract]
[BibTeX]
Contemporary web pages with increasingly sophisticated interfaces rival
traditional desktop applications for interface complexity and are often called
web applications or RIA (Rich Internet Applications). They often require the
execution of JavaScript in a web browser and can call AJAX requests to
dynamically generate the content, reacting to user interaction. From the
automatic data acquisition point of view, thus, it is essential to be able to
correctly render web pages and mimic user actions to obtain relevant data from
the web page content. Briefly, to obtain data through existing Web interfaces
and transform it into structured form, contemporary wrappers should be able to:
1) interact with sophisticated interfaces of web applications; 2) precisely
acquire relevant data; 3) scale with the number of crawled web pages or states
of web application; 4) have an embeddable programming API for integration with
existing web technologies. OXPath is a state-of-the-art technology, which is
compliant with these requirements and demonstrated its efficiency in
comprehensive experiments. OXPath integrates Firefox for correct rendering of
web pages and extends XPath 1.0 for the DOM node selection, interaction, and
extraction. It provides means for converting extracted data into different
formats, such as XML, JSON, CSV, and saving data into relational databases.
This tutorial explains main features of the OXPath language and the setup of
a suitable working environment. The guidelines for using OXPath are provided in
the form of prototypical examples.
Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp.
In:
JCDL '18 Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries , pages 45-48.
ACM, New York, NY, USA, 2018.
Mandy Neumann, Christopher Michels, Philipp Schaer and Schenkel Ralf.
[doi] [pdf]
[abstract]
[BibTeX]
Maintaining literature databases and online bibliographies is a core responsibility of metadata aggregators such as digital libraries. In the process of monitoring all the available data sources the question arises which data source should be prioritized. Based on a broad definition of information quality we are looking for different ways to find the best fitting and most promising conference candidates to harvest next. We evaluate different conference ranking features by using a pseudo-relevance assessment and a component-based evaluation of our approach.
Web-Scraping for Non-Programmers: Introducing OXPath for Digital Library Metadata Harvesting.
Code4Lib Journal, 38, 2017.
Mandy Neumann, Jan Steinberg and Philipp Schaer.
[doi] [pdf]
[abstract]
[BibTeX]
Building up new collections for digital libraries is a demanding task. Available data sets have to be extracted which is usually done with the help of software developers as it involves custom data handlers or conversion scripts. In cases where the desired data is only available on the data provider’s website custom web scrapers are needed. This may be the case for small to medium-size publishers, research institutes or funding agencies. As data curation is a typical task that is done by people with a library and information science background, these people are usually proficient with XML technologies but are not full-stack programmers. Therefore we would like to present a web scraping tool that does not demand the digital library curators to program custom web scrapers from scratch. We present the open-source tool OXPath, an extension of XPath, that allows the user to define data to be extracted from websites in a declarative way. By taking one of our own use cases as an example, we guide you in more detail through the process of creating an OXPath wrapper for metadata harvesting. We also point out some practical things to consider when creating a web scraper (with OXPath). On top of that, we also present a syntax highlighting plugin for the popular text editor Atom that we developed to further support OXPath users and to simplify the authoring process.
Enriching Existing Test Collections with OXPath.
In: G. J. F. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato and F. Nicola, editors,
Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, volume 10456, series Lecture Notes in Computer Science.
2017.
Philipp Schaer and Mandy Neumann.
[doi] [pdf]
[abstract]
[BibTeX]
Extending TREC-style test collections by incorporating external resources is
a time consuming and challenging task. Making use of freely available web data
requires technical skills to work with APIs or to create a web scraping program
specifically tailored to the task at hand. We present a light-weight
alternative that employs the web data extraction language OXPath to harvest
data to be added to an existing test collection from web resources. We
demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with
additional metadata fields harvested via OXPath from the social sciences portal
Sowiport. This allows the re-use of this collection for other evaluation
purposes like bibliometrics-enhanced retrieval. The demonstrated method can be
applied to a variety of similar scenarios and is not limited to extending
existing collections but can also be used to create completely new ones with
little effort.