Locating faulty data in an harvested database - Extending a Metadata language with support for semantic rules to find erroneous data in a vast and incomplete database

Per Gundberg ; Joel Steen Timle
Göteborg : Chalmers tekniska högskola, 2012. 43 s.
[Examensarbete på avancerad nivå]

This thesis deals with the task of finding erroneous entries in a large database whose content have been automatically collected by scanning different sources on the world wide web. The information is divided into different events, organized in different event classes.

As part of the thesis work, a language to describe semantic and structural rules on the information has been designed as an extension to the already existing Metadata language of the database. A set of rules has been written in this language which describes the extended demands. A tool to test the information in the database against rules described in the extended language has also been implemented. The result of the evaluation not only reports if an entry does not fulfill a rule, but also what part of the entry breaks the rule. This information is stored in a database for further analysis and use. Subsets of the database have been checked and during these tests, about five percent of the events did not fulfil all of the rules defined for its event class.

