Abstract

Named entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labeling methods with merged label layers, cascaded models, and those rooted in reading comprehension. Among these, sequence labeling with merged label layers stands out for its simplicity and ease of implementation. Yet, highlighted issues persist within this method, prompting our aim to enhance its efficacy. In this study, we propose augmentations to the sequence labeling approach by employing a pipeline model bifurcated into sequence labeling and text classification tasks. Departing from annotating specific entity categories, we amalgamated types into main and sub-categories for a unified treatment. These categories were subsequently embedded as identifiers in the recognition text for the text categorization task. Our choice of resolution involved BERT+BiLSTM+CRF for sequence labeling and the BERT model for text classification. Experiments were conducted across three nested NER datasets: GENIA, CMeEE, and GermEval 2014, featuring annotations varying from four to two levels. Before model training, we conducted separate statistical analyses on nested entities within the medical dataset CMeEE and the everyday life dataset GermEval 2014. Our research unveiled a consistent dominance of a particular entity category within nested entities across both datasets. This observation suggests the potential utility of labeling primary and subsidiary entities for effective category recognition. Model performance was evaluated based on F1 scores, considering correct recognition only when both the complete entity name and category were identified. Results showcased substantial performance enhancement after our proposed modifications compared to the original method. Additionally, our improved model exhibited strong competitiveness against existing models. F1 scores on the GENIA, CMeEE, and GermEval 2014 datasets reached 79.21, 66.71, and 87.81, respectively. Our research highlights that, while preserving the original method’s simplicity and implementation ease, our enhanced model achieves heightened performance and competitive prowess compared to other methodologies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call