Abstract

Information extraction (IE) is an important problem in Natural Language Processing (NLP) and Web Mining communities. Recently, IE has been applied to online sex advertisements with the goal of powering search and analytics systems that can help law enforcement investigate human trafficking (HT). Extracting key attributes such as names, phone numbers and addresses from online sex ads is extremely challenging, since such webpages contain boilerplate, obfuscation, and extraneous text in unusual language models. Assessing the quality of an IE system is an important problem that is particularly problematic in this domain due to lack of gold standard datasets. Furthermore, building a robust ground truth from scratch is an expensive and time-consuming task for social scientists and law enforcement to undertake. In this article, we undertake the empirical challenge of analyzing the quality of IE outputs in the HT domain without the provision of laboriously annotated ground truths. Specifically, we use concepts from network science to construct and study an extraction graph from IE outputs collected over a corpus of online sex ads. Our studies show that network metrics, which require no labeled ground truths, share interesting and consistent correlations with IE accuracy metrics (e.g., precision and recall) that do require ground-truths. Our methods can potentially be applied for comparing the quality of different IE systems in the HT domain without access to ground-truths.

Highlights

  • Information extraction (IE) is a broad area in both the Natural Language Processing (NLP) and the Web communities (Chang et al 2006a, b)

  • Our results suggest the possibility of using structural metrics, which can be deduced in an unsupervised manner without access to a ground truth, to study whether a given IE is deviating from the ground truth on a quality metric such as precision or F-measure

  • Because we cannot assume a ground-truth, the evaluation is conducted using network-theoretic techniques. All of these individual fields of study have individually received much research attention, as we describe in the sub-sections below

Read more

Summary

Introduction

Information extraction (IE) is a broad area in both the Natural Language Processing (NLP) and the Web communities (Chang et al 2006a, b). The main goal of IE is to extract useful information from raw documents and webpages. Traditional IE, which is assumed in this article, assumes a particular schema according to which information must be extracted and typed. Domain-specific applications, such as human trafficking, generally require the schema to be specific and fine-grained, supporting attributes of interest to investigators, including phone number, address and physical features such as hair color and eye color (Fig. 1). Some attributes may occur as ‘links’ (e.g., phone number) and are not directly visible in the text on the page. There is considerable heterogeneity, both across webpages in the same Web domain (e.g., two individual webpages from backpage.com), and across Web domains

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call