Large Uncertain Databases Research Articles

In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.

Read full abstract

Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the potential to offer substantial speedups for non-selective analytic queries on large uncertain databases. In this paper, we propose a new index called a UPI ( Uncertain Primary Index ) that clusters heap files according to uncertain attributes with both discrete and continuous uncertainty distributions. Because uncertain attributes may have several possible values, a UPI on an uncertain attribute duplicates tuple data once for each possible value. To prevent the size of the UPI from becoming unmanageable, its size is kept small by placing low-probability tuples in a special Cutoff Index that is consulted only when queries for low-probability values are run. We also propose several other optimizations, including techniques to improve secondary index performance and techniques to reduce maintenance costs and fragmentation by buffering changes to the table and writing updates in sequential batches. Finally, we develop cost models for UPIs to estimate query performance in various settings to help automatically select tuning parameters of a UPI. We have implemented a prototype UPI and experimented on two real datasets. Our results show that UPIs can significantly (up to two orders of magnitude) improve the performance of uncertain queries both over clustered and unclustered attributes. We also show that our buffering techniques mitigate table fragmentation and keep the maintenance cost as low as or even lower than using an unclustered heap file.

Read full abstract

Large Uncertain Databases Research Articles

Articles published on Large Uncertain Databases

Probabilistic Reasoning at Scale: Trigger Graphs to the Rescue

Approximation of Probabilistic Maximal Frequent Itemset Mining Over Uncertain Sensed Data

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

FIDOOP: PARALLEL MINING OF FREQUENT ITEM SETS USING MAPREDUCE

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases

An Improved Algorithm for Efficient Mining of Frequent Item Sets on Large Uncertain Databases

Efficient Mining of Frequent Item Sets on Large Uncertain Databases

Model-based probabilistic frequent itemset mining

UPI

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Uncertain Databases Research Articles

Articles published on Large Uncertain Databases

Probabilistic Reasoning at Scale: Trigger Graphs to the Rescue

Approximation of Probabilistic Maximal Frequent Itemset Mining Over Uncertain Sensed Data

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

FIDOOP: PARALLEL MINING OF FREQUENT ITEM SETS USING MAPREDUCE

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases

An Improved Algorithm for Efficient Mining of Frequent Item Sets on Large Uncertain Databases

Efficient Mining of Frequent Item Sets on Large Uncertain Databases

Model-based probabilistic frequent itemset mining

UPI