Background:Electronic Clinical Narratives (ECNs) store valuable individual’s health information. However, there are few available open-source data. Besides, ECNs can be structurally heterogeneous, ranging from documents with explicit section headings or titles to unstructured notes. This lack of structure complicates building automatic systems and their evaluation. Objective:The aim of the present work is to provide the scientific community with a Spanish open-source dataset to build and evaluate automatic section identification systems. Together with this dataset, the purpose is to design and implement a suitable evaluation measure and a fine-tuned language model adapted to the task. Materials and Methods:A corpus of unstructured clinical records, in this case progress notes written in Spanish, was annotated with seven major section types. Existing metrics for the presented task were thoroughly assessed and, based on the most suitable one, we defined a new B2 metric better tailored given the task. Results:The annotated corpus, as well as the designed new evaluation script and a baseline model are freely available for the community. This model reaches an average B2 score of 71.3 on our open source dataset and an average B2 of 67.0 in data scarcity scenarios where the target corpus and its structure differs from the dataset used for training the LM. Conclusion:Although section identification in unstructured clinical narratives is challenging, this work shows that it is possible to build competitive automatic systems when both data and the right evaluation metrics are available. The annotated data, the implemented evaluation scripts, and the section identification Language Model are open-sourced hoping that this contribution will foster the building of more and better systems.
Read full abstract