Microblog-genre noise and impact on semantic annotation accuracy

Leon Derczynski,Niraj Aswani,Diana Maynard,Kalina Bontcheva

doi:10.1145/2481492.2481495

Abstract

Using semantic technologies for mining and intelligent information access to microblogs is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Semantic annotation of tweets is typically performed in a pipeline, comprising successive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). Consequently, errors are cumulative, and earlier-stage problems can severely reduce the performance of final stages. This paper presents a characterisation of genre-specific problems at each semantic annotation stage and the impact on subsequent stages. Critically, we evaluate impact on two high-level semantic annotation tasks: named entity detection and disambiguation. Our results demonstrate the importance of making approaches specific to the genre, and indicate a diminishing returns effect that reduces the effectiveness of complex text normalisation.

Highlights

Semantic annotation is the process of tying machine tractable semantic models to natural language text
Semantic annotation is about annotating in texts all mentions of concepts from the ontology, through metadata referring to their URIs
Reliable semantic annotation of user-generated content is an enabler for other semantic technologies [4], including opinion mining [28], summarisation [38], semantic-based search, recommendation, visual analytics, and user and community modelling [41]

Summary

Introduction

Semantic annotation is the process of tying machine tractable semantic models to natural language text. In recent years, social media – and microblogging in particular – have established themselves as high-value, high-volume content, which organisations increasingly wish to analyse automatically. Reliable semantic annotation of user-generated content is an enabler for other semantic technologies [4], including opinion mining [28], summarisation [38], semantic-based search, recommendation, visual analytics, and user and community modelling [41]. It is relevant in many application contexts [12], including knowledge management, competitor intelligence, customer relation management, eBusiness, eScience, eHealth, and eGovernment

Objectives

Methods

Findings

Conclusion