Summary of the first ACM SIGKDD workshop on knowledge discovery from uncertain data (U'09)

Jian Pei,Ander De Keijzer,Lise Getoor

doi:10.1145/1809400.1809419

Abstract

The importance of uncertain data is growing quickly in many essential applications such as environmental monitoring, mobile object tracking and data integration. Recently, storing, collecting, processing, and analyzing uncertain data has attracted increasing attention from both academia and industry. Analyzing and mining uncertain data needs collaboration and joint effort from multiple research communities. Based on this motivation, we ran the First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U’09) in conjunction with the 2009 SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09) at Paris. The focus of this workshop was to bring together and bridge research in reasoning under uncertainty, probabilistic databases and mining uncertain data. Work in statistics and probabilistic reasoning can provide support with models for representing uncertainty, work in the probabilistic database community can provide methods for storing and managing uncertain data, while work in the mining uncertain data can define data analysis tasks and methods. It is important to build connections among those communities to tackle the overall problem of analyzing and mining uncertain data. There are many common challenges among the communities. One is understanding the different modeling assumptions made, and how they impact the methods, both in terms of accuracy and efficiency. Different researchers hold different assumptions about the semantics for probabilistic models and uncertainty. This is one of the major obstacles in the research of mining uncertain data. Another challenge is the scalability of proposed management and analysis methods. Finally, to make analysis and mining useful and practical, we need real data sets for testing. Unfortunately, uncertain data sets are often hard to get and hard to share. The theme of this workshop was to make connections among the research areas of probabilistic databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology. Although the workshop was allocated to only half day, we had a very dynamic and exciting program. The workshop was among one of the best attended ones in conjunction with the conference. There were about 40 attendees when the workshop started. We were lucky to have two excellent invited talks in the workshop. Professor Christopher Jermaine at Rice University gave a talk on “Managing and Mining Uncertain Data: What Might We Do Better?”. In this talk, he expressed a few of his strongly-held opinions on the management and mining of uncertain data. He argued that those who work in the field should listen very carefully to complaints from machine learning experts, who often say, “but all of our methods were already designed to work with uncertain data, so you are wasting your time!” Furthermore, he contended that too much work aimed at managing uncertainty is tightly coupled to first-order logic and related ideas. He also argued that Bayesian approaches and Monte Carlo methods should be much more widely employed in this area. Finally, he argued that too much work in this area neglects the application domains where uncertainty is most important: “what if” analysis, risk assessment, and predication. In his invited talk titled “Querying and Mining Uncertain Data: Methods, Applications, and Challenges”, Dr. Matthias Renz at Ludwig-Maximilians Universitat (LMU) Munchen summarized several very interesting projects in his group exploring various aspects of mining uncertain data, particularly from the point of view of efficiency. The efficiency concern is particularly important for modern databases since they allow users to incorporate uncertainty of data in the hope of increasing the quality of query results. Dr. Matthias Renz gave an overview of modeling uncertain data in feature spaces and illustrated diverse probabilistic similarity search methods which are important tools for many mining applications. In this context, he discussed some current methods as well as the challenges in clustering uncertain data and mining probabilistic rules. The two invited talks were very successful — they led to interesting discussions among the audience and the invited speakers. The invited speeches helped to highlight the interdisciplinary nature of the workshop. The program committee accepted eight papers — four of them were 15 minute presentations and the other four were 10 minute presentations. In the paper titled “Efficient Algorithms for Mining Constrained Frequent Patterns from Uncertain Data”, Leung and Brajczuk argue that constrained frequent pattern mining from uncertain data is important since constrained frequent pattern mining and mining frequent patterns from uncertain data often happen in some common applications such as analyzing medical laboratory data. They developed

Full Text