Extracting named entities from clinical free-text presents unique challenges, particularly when dealing with discontinuous entities-mentions that are separated by unrelated words. Traditional NER methods often struggle to accurately identify these entities, prompting the development of specialised computational solutions. This paper systematically reviews and presents the methodologies developed for Discontinuous Named Entity Recognition in clinical texts, highlighting their effectiveness and the challenges they face. We conducted a systematic literature review focused on discontinuous named entities, using structured searches across four Computer Science-related electronic databases. A combination of search terms, grouped into three synonym categories-problem, entity/approach, and task-yielded 2,442 articles. Guided by our research objectives, we identified five key dimensions to systematically annotate and normalise the data for comprehensive analysis. The review included 44 studies which were coded across several key dimensions: the chronological development of approaches, the corpora used, the downstream tasks affected by discontinuous named entities, the methodological approaches proposed to address the issue, and the reported performance outcomes. The discussion section examines the challenges encountered in this area and suggests potential directions for future research. Significant progress has been made in discontinuous named entity recognition; however, there remains a need for more adaptable, generalisable solutions that are independent of custom annotation schemes. Exploring various configurations of generative language models presents a promising avenue for advancing this area. Additionally, future research should investigate the impact of precise versus imprecise recognition of discontinuous entities on clinical downstream tasks to better understand its practical implications in healthcare applications.
Read full abstract