Kind Of Data Distribution Research Articles

Classifier learning with datasets suffering from imbalance class distribution is a challenging task and it hinders the performance of machine learning algorithms. This imbalance occurs when a particular class is highly outnumbered than that of another class. Such kind of data distribution in the real world applications caught the attention of many researchers. This paper presents the review of various state of the art sampling techniques and ensemble techniques to resolve class imbalance. This paper also investigates the other factors such as threshold of distribution, inter or within class imbalance, etc., that make class imbalance a more complex issue. Comparisons of various approaches viz. data sampling, cost sensitive methods, bagging, boosting which alleviate the class imbalance problem are investigated in detail for their effects on class imbalance problem. Different parameters have been reviewed for measuring and evaluating the performance of the model. Accuracy is majorly used as evaluation parameter in machine learning problems, but from reviews it is found that there are different parameters such as precision, recall and AU-ROC which provide statistical measures for evaluating the model. The paper gives research directions in the domain of class imbalance problems.

Read full abstract

Performance benchmarking has played an important role in the research and development in relational DBMS, object-relational DBMS, data warehouse systems, etc. We believe that benchmarking data mining algorithms is a long overdue task, and it will play an important role in the research and development of data mining systems as well. Frequent pattern mining forms a core component in mining associations, correlations, sequential patterns, partial periodicity, etc., which are of great potential value in applications. There have been a lot of methods proposed and developed for efficient frequent pattern mining in various kinds of databases, including transaction databases, time-series databases, etc. However, so far there is no serious performance benchmarking study of different frequent pattern mining methods. To facilitate an analytical comparison of different frequent mining methods, we have constructed an open test bed for performance study of a set of recently developed, popularly used methods for mining frequent patterns in transaction databases and mining sequential patterns in sequence databases, with different data characteristics. The testbed consists of the following components. A synthetic data generator, which can generate large sets of synthetic data in various kinds of data distributions. A few large data sets from real world applications will also be provided. A good set of typical frequent pattern mining methods, ranging from classical algorithms to recent studies. The method are grouped into three classes: frequent pattern mining, max-pattern mining , and sequential pattern mining . For frequent pattern mining, we will demonstrate Apriori, hashing, partitioning, sampling, TreeProjection, and FP-growth. For maximal pattern mining, we will demonstrate MaxMiner, TreeProjection, and FP-growth-max. For sequential pattern mining, we will demonstrate GSP and FreeSpan. A set of performance curves. These algorithms their running speeds, scalabilities, bottlenecks, and performance on different data distributions, will be compared and demonstrated upon request. Some performance curves from our pre-conference experimental evaluations will also be shown. An open testbed. Our goal is to construct an extensible test bed which integrates the above components and supports an open-ended testing service. Researchers can upload the object codes of their mining algorithms, and run them in the test bed using these data sets. The architecture is shown in Figure 1. This testbed is our first step towards benchmarking data mining algorithms. By doing so, performance of different algorithms can be reported consistently, on the same platform, and in the same environment. After the demo, we plan to make the testbed available on the WWW so that it may, hopefully, benefit further research and development of efficient data mining methods.

Read full abstract

Kind Of Data Distribution Research Articles

Articles published on Kind Of Data Distribution

FLEE: A Hierarchical Federated Learning Framework for Distributed Deep Neural Network over Cloud, Edge, and End Device

Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data

A survey on effects of class imbalance in data pre-processing stage of classification problem

Comparative Investigations and Performance Evaluation for Multiple-Level Association Rules Mining Algorithm

Towards data mining benchmarking

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Kind Of Data Distribution Research Articles

Articles published on Kind Of Data Distribution

FLEE: A Hierarchical Federated Learning Framework for Distributed Deep Neural Network over Cloud, Edge, and End Device

Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data

A survey on effects of class imbalance in data pre-processing stage of classification problem

Comparative Investigations and Performance Evaluation for Multiple-Level Association Rules Mining Algorithm

Towards data mining benchmarking