Abstract

Analysis of extreme-scale data is an emerging research topic; the explosion in available data raises the need for suitable content verification methods and tools to decrease the analysis and processing time of various applications. Personal data, for example, are a very valuable source of information for several purposes of analysis, such as marketing, billing and forensics. However, the extraction of such data (referred to as person instances in this study) is often faced with duplicate or similar entries about persons that are not easily detectable by the end users. In this light, the authors of this study present a machine learning- and deep learning-based approach in order to mitigate the problem of duplicate person instances. The main concept of this approach is to gather different types of information referring to persons, compare different person instances and predict whether they are similar or not. Using the Jaro algorithm for person attribute similarity calculation and by cross-examining the information available for person instances, recommendations can be provided to users regarding the similarity or not between two person instances. The degree of importance of each attribute was also examined, in order to gain a better insight with respect to the declared features that play a more important role.

Highlights

  • This section is divided into three subsections: the first one presents the results for the machine learning (ML) algorithms that take the imbalanced dataset as input; the second one presents the results for the ML algorithms which use the up-sampled balanced dataset as input; and the last one presents the importance of the five features calculated using the coefficient calculation

  • The algorithms that used the imbalanced dataset of the 4950 records as input were the logistic regression, the neural network, the random forest and the penalized support vector classifier

  • Did notimportance improve the of performance of the neural networks (NNs)

Read more

Summary

Introduction

The ongoing Fourth Industrial Revolution has shifted everyday human activities to a more digitized nature. Societies across the world are becoming increasingly digitized in a wide spectrum of their activities, e.g., financial transactions, communication, social interactions and work. The continuous digitalization in our hyper-connected society has enabled the generation of vast volumes of data. The global internet traffic has increased dramatically over the last 30 years and still continues its uprising trend. According to CISCO [1], the annual network traffic for 2020 was 2.3 zettabytes or 61,386GB per second. The projection of the global internet traffic according to the World Data Bank is that, in

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call