Abstract

Anomalies are those records, which have different behavior and do not comply with the remaining records in the dataset. Outlier analysis is the concept to find anomalies in Datasets.  Detecting outliers efficiently is an important issue in many fields of science, medicine and technology. Many methods are available to detect anomalies in numerical datasets but a limited number of methods available for categorical datasets. In this work, a novel method to detect outliers in categorical data based on entropy is proposed. This algorithm finds anomalies based on each record score and has great intuitive appeal. These scores called BAD scores. This algorithm utilizes the frequency of each value in the dataset. Greedy method needs k- scans of dataset to find ‘k’ outliers where as the proposed method needs only one scan of dataset and it calculates BAD score of each record directly. It avoids the problem of giving ‘k’ as an input and can find any number of outliers based on our data set directly.AVF method has less time complexity when compared with the other methods like Greedy, FPOF and FDOD. Greedy has good accuracy when compared with other methods like AVF and FPOF, FDOD (which are based on frequency patterns of all combinations of values in each record). Our algorithm shows better results in accuracy than AVF algorithm and Greedy. But this method has reached nearest to AVF in time complexity. This algorithm has been applied on Nursery dataset and Bank dataset taken from “UCI Machine Learning Repository”. In this work, it is proposed to extend Normal distribution [11], and Fuzzy concept [12] to BAD score [13] that is NAVF combined with Fuzzy AVF is applied to BAD Score.  Numerical attributes are excluded from Datasets for our analysis. The experimental results show that it is efficient for outlier detection in categorical dataset.

Highlights

  • Outlier analysis is an important research field in many fields like networks, medicine and Business decisions

  • Most of the existing systems concentrate on numerical attributes or ordinal attributes and sometimes, categorical attribute values can be converted into ordinal values there to categorical values

  • Attribute Value Frequency (AVF) method is one of the efficient methods to detect outliers in categorical data in time complexity and greedy in accuracy. The mechanism in this AVF method is that, it calculates frequency of each value in each data attribute and finds their probability, and it finds the attribute value frequency for each record by averaging probabilities and selects top k- outliers based on the least AVF score

Read more

Summary

Introduction

Outlier analysis is an important research field in many fields like networks, medicine and Business decisions. The parameters used in FPOF and FDOD are σ, a threshold value to decide frequent item sets in each data object and „k‟, the number of outliers. There are many drawbacks in this method like difficulty of finding a correct model for different datasets and the efficiency of these models decreases as the number of dimensions increases [4] The remedy for this problem is applying the Principle Component Analysis. Knorr‟s et al [5], achieved some improvements in the distance-based algorithms They have explained that a part of dataset records belong to each outlier must be less than some threshold value. These density based methods have some advantages that they can detect outliers those are left by techniques with single, global criterion methods These methods find characteristics of objects instead of finding distances, densities and statistical parameters.

TERMINOLOGY
Experimental results
Sample Method
Conclusion and Future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.