Management of uncertain data : towards unattended integration

A De Keijzer

doi:10.3990/1.9789036526197

Abstract

In recent years, the need to support uncertain data has increased. Sensor applications, for example, are dealing with the inherent uncertainty about the readings of the sensors. Current database management systems are not equipped to deal with this uncertainty, other than as a user defined attribute. This forces the user of the DBMS to take on the responsibility of managing the uncertainty associated with the data. In this thesis, we present a new data model, based on XML that is capable of storing uncertainty about elements and subtrees. The XML data model is extended in such a way, that probabilities can be associated with the elements and subtrees, dependency and independency of elements can be expressed and even the existence of entire elements or subtrees can be uncertain. We give a sound semantical foundation for dealing with the uncertainty associated with the data, and show how querying using this semantics works. The probabilistic XML data model is used in an information integration application. Decisions about equality are postponed if the integration system is uncertain about equality. This uncertainty is stored using the probabilistic XML data model, making the integration process itself unattended. The amount of uncertainty arising from this integration can be large. We therefore introduce knowledge rules that help deciding on equality during the integration phase. Using these rules, integrated documents contain less uncertainty and are therefore smaller in size. We also introduced two measures with which the amount of uncertainty in the document can be quantified. Uncertainty density measures the amount of uncertainty in the database. The second measure, answer decisiveness, quantifies the ease with which most likely possibilities in query results can be chosen. At a later stage, when the user is querying the information source, and therefore already actively using the system, feedback can be provided on query results. This feedback is explained in the same semantical setting as querying. Feedback statements can either be positive, i.e. the query result can be observed in the real world, or negative, i.e. the query result cannot be observed in the real world. We show that using this feedback technique, if used with caution, reduces the amount of uncertainty and lets the information source converge to a correctly integrated document. To measure the quality of query results, we adapted precision and recall for probabilistic data in a way that, for example incorrect answers with low probability do not have the same negative impact as incorrect answers with a high probability.

Full Text