Abstract
We investigated the limitations of conventional named entity recognition (NER) and entity linking (EL) methods in accurately extracting patient condition information from medical texts, focusing on the challenges posed by non-contiguous spans and the potential information loss. We utilized a corpus with entity-relation annotations, analyzing the frequency and nature of non-contiguous spans that include irrelevant entities within gaps. The corpus was further analyzed to pinpoint the types of entity representations predominantly linked with peripheral spans—those not encompassing central symptom-describing terms—with a focus on items, body parts, and clinical tests. Our analysis revealed that 18.6 % of patient condition expressions were non-contiguous spans containing irrelevant entities, suggesting an accuracy ceiling of 81.4 % for conventional NER and EL approaches in the worst-case scenario. The study highlights the importance of entity types such as items, body parts, and clinical tests in these expressions, indicating that conventional extraction methods incur considerable information loss. The findings underscore the need for more sophisticated information extraction techniques capable of handling the complexities of medical texts, including non-contiguous spans. Adapting methods that allow gaps within entities or employing graph-based term assignments can enhance the accuracy and comprehensiveness of medical text annotation.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have