Annotated Corpus Research Articles

As databases make Corpus Linguistics a common tool for most linguists, corpus annotation becomes an increasingly important process. Corpus users do not need only raw data, but also annotated data, submitted to tagging or parsing processes through annotation protocols. One problem with corpus annotation lies in its reliability, that is, in the probability that its results can be replicable by independent researchers. Inter-annotation agreement (IAA) is the process which evaluates the probability that, applying the same protocol, different annotators reach similar results. To measure agreement, different statistical metrics are used. This study applies IAA for the first time to the Valencia Espaol Coloquial (Val.Es.Co.) discourse segmentation model, designed for segmenting and labelling spoken language into discourse units. Whereas most IAA studies merely label a set of in advance pre-defined units, this study applies IAA to the Val.Es.Co. protocol, which involves a more complex two-fold process: first, the speech continuum needs to be divided into units; second, the units have to be labelled. Kripendorffs u -family statistical metrics (Krippendorff et al. 2016) allow measuring IAA in both segmentation and labelling tasks. Three expert annotators segmented a spontaneous conversation into subacts, the minimal discursive unit of the Val.Es.Co. model, and labelled the resulting units according to a set of 10 subact categories. Kripendorffs u coefficients were applied in several rounds to elucidate whether the inclusion of a bigger number of categories and their distinction had an impact on the agreement results. The conclusions show high levels of IAA, especially in the annotation of procedural subact categories, where results reach coefficients over 0.8. This study validates the Val.Es.Co. model as an optimal method to fully analyze a conversation into pragmatically-based discourse units.

Read full abstract

BackgroundDrug repurposing is to find new indications of approved drugs, which is essential for investigating new uses for approved or investigational drug efficiency. The active gene annotation corpus (named AGAC) is annotated by human experts, which was developed to support knowledge discovery for drug repurposing. The AGAC track of the BioNLP Open Shared Tasks using this corpus is organized by EMNLP-BioNLP 2019, where the “Selective annotation” attribution makes AGAC track more challenging than other traditional sequence labeling tasks. In this work, we show our methods for trigger word detection (Task 1) and its thematic role identification (Task 2) in the AGAC track. As a step forward to drug repurposing research, our work can also be applied to large-scale automatic extraction of medical text knowledge.MethodsTo meet the challenges of the two tasks, we consider Task 1 as the medical name entity recognition (NER), which cultivates molecular phenomena related to gene mutation. And we regard Task 2 as a relation extraction task, which captures the thematic roles between entities. In this work, we exploit pre-trained biomedical language representation models (e.g., BioBERT) in the information extraction pipeline for mutation-disease knowledge collection from PubMed. Moreover, we design the fine-tuning framework by using a multi-task learning technique and extra features. We further investigate different approaches to consolidate and transfer the knowledge from varying sources and illustrate the performance of our model on the AGAC corpus. Our approach is based on fine-tuned BERT, BioBERT, NCBI BERT, and ClinicalBERT using multi-task learning. Further experiments show the effectiveness of knowledge transformation and the ensemble integration of models of two tasks. We conduct a performance comparison of various algorithms. We also do an ablation study on the development set of Task 1 to examine the effectiveness of each component of our method.ResultsCompared with competitor methods, our model obtained the highest Precision (0.63), Recall (0.56), and F-score value (0.60) in Task 1, which ranks first place. It outperformed the baseline method provided by the organizers by 0.10 in F-score. The model shared the same encoding layers for the named entity recognition and relation extraction parts. And we obtained a second high F-score (0.25) in Task 2 with a simple but effective framework.ConclusionsExperimental results on the benchmark annotation of genes with active mutation-centric function changes corpus show that integrating pre-trained biomedical language representation models (i.e., BERT, NCBI BERT, ClinicalBERT, BioBERT) into a pipe of information extraction methods with multi-task learning can improve the ability to collect mutation-disease knowledge from PubMed.

Read full abstract

Annotated Corpus Research Articles

Related Topics

Articles published on Annotated Corpus

Detecting Offensive Language in Romanian Social Media

A Unified Framework of Medical Information Annotation and Extraction for Chinese Clinical Text

Research on the Construction of Chinese Argument Corpus

Using textual volunteered geographic information to model nature-based activities: A case study from Aotearoa New Zealand

Inter-annotator agreement in spoken language annotation: Applying uα-family coefficients to discourse segmentation

Developing Core Technologies for Resource-Scarce Nguni Languages

A Domain Based Approach to Semantic Lexicon Expansion

Toward Data-Driven Collaborative Dialogue Systems: The JILDA Dataset

PheneBank: a literature-based database of phenotypes.

The Homeric Dependency Lexicon

Fine-grained legal entity annotation: A case study on the Brazilian Supreme Court

A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages

Drug knowledge discovery via multi-task learning and pre-trained models

PTPARL-D: an annotated corpus of forty-four years of Portuguese parliamentary debates

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

The first annotated corpus of historical Basque

DISCO PAL: Diachronic Spanish sonnet corpus with psychological and affective labels

Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages

Measuring Orthogonal Mechanics in Linguistic Annotation Games

COVID-19 recommender system based on an annotated multilingual corpus.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Annotated Corpus Research Articles

Related Topics

Articles published on Annotated Corpus

Detecting Offensive Language in Romanian Social Media

A Unified Framework of Medical Information Annotation and Extraction for Chinese Clinical Text

Research on the Construction of Chinese Argument Corpus

Using textual volunteered geographic information to model nature-based activities: A case study from Aotearoa New Zealand

Inter-annotator agreement in spoken language annotation: Applying uα-family coefficients to discourse segmentation

Developing Core Technologies for Resource-Scarce Nguni Languages

A Domain Based Approach to Semantic Lexicon Expansion

Toward Data-Driven Collaborative Dialogue Systems: The JILDA Dataset

PheneBank: a literature-based database of phenotypes.

The Homeric Dependency Lexicon

Fine-grained legal entity annotation: A case study on the Brazilian Supreme Court

A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages

Drug knowledge discovery via multi-task learning and pre-trained models

PTPARL-D: an annotated corpus of forty-four years of Portuguese parliamentary debates

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

The first annotated corpus of historical Basque

DISCO PAL: Diachronic Spanish sonnet corpus with psychological and affective labels

Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages

Measuring Orthogonal Mechanics in Linguistic Annotation Games

COVID-19 recommender system based on an annotated multilingual corpus.