Abstract

Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org. Database URL: http://markyt.org

Highlights

  • The availability of large, manually annotated text corpora is highly desirable for the development of text mining methods and the robust evaluation and comparison of alternative approaches

  • Demos of various annotation projects are available as part of Markyt documentation

  • The corpora were selected based on their nature and specifics, aiming to show the annotation features and analysis capabilities of Markyt at different levels of complexity

Read more

Summary

Introduction

The availability of large, manually annotated text corpora is highly desirable for the development of text mining methods and the robust evaluation and comparison of alternative approaches. The corpus production workflow may be adapted to the specificities of a given domain of application, but there are common issues to attend to, such as transduction into and out of different formats as well as execution of multiple annotation rounds of multiple annotators with evaluations for consistency at several points. Underlying design principles include (i) general purpose application, i.e. domain specifications are considered only in project configuration and do not affect the general behaviour of the software, (ii) modular and flexible architecture, which enables seamless component extension, (iii) user-friendly and continuously improved interface for human curators, and (iv) powerful analytical abilities that enable corpus quality assessment throughout the whole production cycle. Markyt enables the creation of multi-user and multi-round annotation projects and implements analytical functionalities for assessing the consistency of the annotations of individual annotators throughout time and interannotator agreement (IAA) comprehensively. Regardless of the specifics of each project, the main objectives are to reach a harmonized interpretation of the annotation guidelines among human curators and to be able to achieve an annotator consensus, i.e. produce a final, high-quality version of the corpus

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call