Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain

Hunter Burk,Faizan Javed,Janani Balaji

doi:10.1109/icdmw.2017.29

Abstract

Job ad data has become an essential part of the recruiting world, helping recruiters to construct views of the labor market to determine emerging skills, closest competitors, and where to get the most value for each recruiting dollar spent. Collecting this data, however, can be problematic, as job ads are posted redundantly at numerous online locations. In this paper, we detail a domain-specific near-duplicate detection methodology aimed at tackling this problem. More specifically, we discuss Apollo, a near-duplicate detection system for job ads. Apollo is in production at CareerBuilder, a large online recruitment company and powers many downstream analytics applications. Its effectiveness, predicated on precision, recall, F-score, and run time, is then compared against other industry-standard deduplication methods to prove its viability over existing paradigms.

Full Text