First Steps towards Data-Driven Adversarial Deduplication

Jose N Paredes,Marcelo A Falappa,Maria Vanina Martinez,Gerardo I Simari

doi:10.3390/info9080189

Abstract

In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.

Highlights

Introduction and MotivationThe classical problem of entity resolution—or deduplication—in databases seeks to address situations in which seemingly distinct records are stored that refer to the same entity in the real world
We argue that this problem is fundamentally different from but closely related to the traditional entity resolution or deduplication problem in databases, since this problem is assumed to arise as a consequence of unintentional errors
We focus on the cyber-security setting of malicious hacker forums and marketplaces on the dark web, where such intentional obfuscation is the norm

Summary

Introduction

The classical problem of entity resolution—or deduplication—in databases seeks to address situations in which seemingly distinct records are stored that refer to the same entity (object, person, place, etc.) in the real world. The characteristic that is overwhelmingly shared among these traditional approaches is that they assume that the existence of multiple records for the same real entity is the product of involuntary situations such as simple typos during data entry procedures, ambiguity in attribute values such as Information 2018, 9, 189; doi:10.3390/info9080189 www.mdpi.com/journal/information. The same actor typically operates using different profiles, but keeping certain characteristics constant. Perhaps most importantly, they leave involuntary traces behind that can be analyzed and leveraged by deduplication tools. Consider the problem of trying to determine clues that point to the conclusion that a given pair of faces might correspond to the same real-world user (or perhaps to the opposite conclusion)

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Jul 27, 2018
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

First Steps towards Data-Driven Adversarial Deduplication

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process using Support Vector Machines and Latent Dirichlet Allocation
Isuf Deliu ... Katrin Franke
-
Isuf Deliu, et. al.Isuf Deliu ... Katrin Franke
01 Dec 2018
01 Dec 2018

An Entity Resolution Using Query Sensible Approach
Bheema Rasagna ... Kummaragunta Lakshmi Prasanna
Indian Journal of Science and Technology | VOL. 11
Bheema Rasagna, et. al.Bheema Rasagna ... Kummaragunta Lakshmi Prasanna
01 Dec 2018
Indian Journal of Science and Technology | VOL. 11

On Link Validity in Bibliographic Knowledge Bases
Madalina Croitoru ... Michel Leclère
-
Madalina Croitoru, et. al.Madalina Croitoru ... Michel Leclère
01 Jan 2012
01 Jan 2012

A Grid and Cloud Based System for Data Grouping Computation and Online Service
Wing-Ning Li ... Cameron Porter
-
Wing-Ning Li, et. al.Wing-Ning Li ... Cameron Porter
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

First Steps towards Data-Driven Adversarial Deduplication

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information