A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Scott Piao,Fraser Dallachy,Alistair Baron,Jane Demmen,Steve Wattam,Philip Durkin,James Mccracken,Paul Rayson,Marc Alexander

doi:10.1016/j.csl.2017.04.010

Abstract

Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applying time-sensitive methods improved results by as much as 3.54% and by 1.72% on average.

Highlights

Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science
TagedPOver recent years, various semantic lexical resources and semantic annotation tools have been developed, such as EuroWordNet (Vossen, 1998) and the UCREL (University Centre for Computer Corpus Research on Language) Semantic Analysis System (USAS) (Rayson et al, 2004), and they have played an important role in developing intelligent natural language processing (NLP) and Human language technology (HLT) systems
TagedPIn this section, we describe our evaluation of the HTST, including test data preparation and evaluation criteria, statistical results of the HTST performance and the impacts of the main disambiguation methods implemented in the HTST (Section 6.2), and software design to improve the runtime speed of the HTST software (Section 6.3)

Summary

Introduction1

TagedPSemantic analysis of natural language data is a relevant task for a wide range of research areas and practical applications, such as natural language processing, text mining, corpus linguistics and data science. Some tools are designed to identify the topic or themes of given texts (Allan, 2012), and some are designed to extract specific partial information, such as types of named entities, categories of relations between the specific named entities, and/ or types of events (Miwa et al, 2012; Rizzo and Troncy, 2012; Weston et al, 2013) Another group of semantic annotation tools are designed to identify semantic categories of all lexical units based on a given classification scheme, which can support a deep comprehensive semantic information analysis and extraction from language data. TagedPIn this paper, we present our work in designing, developing and evaluating the accuracy of a new semantic tagger: the “Historical-Thesaurus-based Semantic Tagger” ( HTST) The purpose of this tool is to annotate all lexical units of texts with a fine-grained semantic categorization scheme provided by a very large-scale and highquality English historical thesaurus (Kay et al, 2016 [2009]) (detailed further )

Related work

Abbreviations

Structure of Historical Thesaurus entries

Architecture of the HTST system

A: General and abstract terms C: Arts and crafts F: Food and farming H

B: The body and the individual E

Disambiguation of HT semantic categories for words

Evaluation

Test data preparation

Impacts of disambiguation methods

Overview of main error types

Issue of speed as a resource-intensive software

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Speech & Language	Publication Date: May 17, 2017
Citations: 18	License type: cc-by

R Discovery Prime

R Discovery Prime

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Similar Papers

Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods
Jawad Shafi ... Paul Rayson
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Jawad Shafi, et. al.Jawad Shafi ... Paul Rayson
17 Jun 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

An Urdu semantic tagger - lexicons, corpora, methods and tools

-

30 Sep 2019
30 Sep 2019

Towards A Semantic Tagger for Analysing Contents of Chinese Corporate Reports
Scott Piao
-
Scott PiaoScott Piao
09 Feb 2016
09 Feb 2016

OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
A Pareja-Lora
-
A Pareja-LoraA Pareja-Lora
09 Nov 2012
09 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language