Abstract

BackgroundThe increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation.ResultsIn this paper, we describe FlexiTerm, a method for automatic term recognition from a domain-specific corpus, and evaluate its performance against five manually annotated corpora. FlexiTerm performs term recognition in two steps: linguistic filtering is used to select term candidates followed by calculation of termhood, a frequency-based measure used as evidence to qualify a candidate as a term. In order to improve the quality of termhood calculation, which may be affected by the term variation phenomena, FlexiTerm uses a range of methods to neutralise the main sources of variation in biomedical terms. It manages syntactic variation by processing candidates using a bag-of-words approach. Orthographic and morphological variations are dealt with using stemming in combination with lexical and phonetic similarity measures. The method was evaluated on five biomedical corpora. The highest values for precision (94.56%), recall (71.31%) and F-measure (81.31%) were achieved on a corpus of clinical notes.ConclusionsFlexiTerm is an open-source software tool for automatic term recognition. It incorporates a simple term variant normalisation method. The method proved to be more robust than the baseline against less formally structured texts, such as those found in patient blogs or medical notes. The software can be downloaded freely at http://www.cs.cf.ac.uk/flexiterm.

Highlights

  • The increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation

  • Data FlexiTerm is a domain independent automatic term recognition (ATR) method, that is – it does not rely on any domain specific knowledge to recognise terms in a domain specific corpus

  • In order to demonstrate the portability of our method across sublanguages, i.e. languages confined to specialised domains [40], we used multiple data sets from different biomedical subdomains as well as text written by different types of authors and/or aimed at different audience

Read more

Summary

Introduction

The increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. Terms are linguistic representations of domain-specific concepts [2]. For practical purposes, they are often defined as phrases (typically nominal [3,4]) that frequently occur in texts restricted to a specific domain and have special meaning in a given domain. Terms are distinguished from other salient phrases by the measures of their unithood and termhood [4]. Termhood implies that terms carry heavier information load compared to other phrases used in a sublanguage, and as such they

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call