Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

Kyle Porter,Slobodan Petrovic

doi:10.1007/978-3-319-99277-8_5

Abstract

Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.

Highlights

Keyword search has been a staple in digital forensics since its beginnings, and a number of forensic tools incorporate fuzzy search algorithms that match text against keywords with typographical errors or keywords that are stringologically similar
Data points labeled k = 1 and k = 2 represent the results for unconstrained fuzzy searches with edit distance thresholds set to one and two, respectively
The application of constraints to fuzzy searches of the Enron inverted index resulted in higher recall than an unconstrained fuzzy search with an edit distance threshold of k = 1, and better precision than an unconstrained fuzzy search with an edit distance threshold of k = 2

Summary

Introduction

ADVANCES IN DIGITAL FORENSICS XIV approximate string matching) algorithms that match text against keywords with typographical errors or keywords that are stringologically similar. These algorithms may be used to search inverted indexes, where every approximate match is linked to a list of documents that contain the match. Great discretion must be used when employing these forensic tools to search large datasets because many strings that match (approximately) may be similar in a stringological sense, but are completely unrelated in terms of their semantics. Since cedas implements an extension of this automaton, it is useful to discuss some key components of automata theory. The set of strings that result in a match are considered to be accepted by the automaton; this set is the language L recognized by the automaton

Methods

Results

Conclusion