FleXPath for HTML

This document is about using a presented approach for increasing the retrieval result for searches on XML data by combining database-style querying, like XPath and XQuery, and IR-style querying to improve the result of a web information extraction process on HTML formatted data. In the course of this, a framework was developed to use technologies presented in the original approach and add methods to suit the needs of HTML data and web information extraction processes. This framework contains modules to use XPath expressions as a template for DOM tree queries and evaluate found texts using a user defined rule set for structural features of it. It can therefore be used to support existing web information extraction frameworks, which rely on generated XPath based wrappers. This framework is then evaluated against a sample set containing heterogeneous HTML source files by simulating a web extraction process.

Abschlussarbeit

Abschluss

B.Sc.

Bearbeiterin

Hendrik Adam

Betreuer/in

Philipp Schaer Mandy Neumann

Information Retrieval Research Group

IR Research Group

Technische Hochschule Köln

FleXPath for HTML

Abschlussarbeit