A supervised framework for resolving coreference in clinical records

Sanda M Harabagiu,Bryan Rink,Kirk Roberts

doi:10.1136/amiajnl-2012-000810

Abstract

A method for the automatic resolution of coreference between medical concepts in clinical records. A multiple pass sieve approach utilizing support vector machines (SVMs) at each pass was used to resolve coreference. Information such as lexical similarity, recency of a concept mention, synonymy based on Wikipedia redirects, and local lexical context were used to inform the method. Results were evaluated using an unweighted average of MUC, CEAF, and B(3) coreference evaluation metrics. The datasets used in these research experiments were made available through the 2011 i2b2/VA Shared Task on Coreference. The method achieved an average F score of 0.821 on the ODIE dataset, with a precision of 0.802 and a recall of 0.845. These results compare favorably to the best-performing system with a reported F score of 0.827 on the dataset and the median system F score of 0.800 among the eight teams that participated in the 2011 i2b2/VA Shared Task on Coreference. On the i2b2 dataset, the method achieved an average F score of 0.906, with a precision of 0.895 and a recall of 0.918 compared to the best F score of 0.915 and the median of 0.859 among the 16 participating teams. Post hoc analysis revealed significant performance degradation on pathology reports. The pathology reports were characterized by complex synonymy and very few patient mentions. The use of several simple lexical matching methods had the most impact on achieving competitive performance on the task of coreference resolution. Moreover, the ability to detect patients in electronic medical records helped to improve coreference resolution more than other linguistic analysis.

Full Text