The increasing use of electronic health records (EHRs) in healthcare has led to a significant amount of unstructured clinical text data. This paper proposes a model for classifying free-text clinical notes in the Portuguese language in sentences following the SOAP (Subjective, Objective, Assessment, and Plan) note standard using domain-specific pre-trained language models. Among the five pre-trained BERT models tested, BioBERTptRT achieved the best results with a precision of 0.9461, accuracy of 0.9434, recall of 0.9437, and F1-score of 0.9435. BioBERTptRT, specialized in the Portuguese language, clinical terminology, and the medical group’s domain, outperformed the other models, showing a 0.28% increase in the F1-score compared to the second most specialized model. The proposed model focuses on high-level sentence classification rather than entity-level classification and aims to structure clinical notes at a broader level. The study utilizes a private database of 10,000 anonymized health records containing 234,673 clinical notes. These notes were divided into 1,183,345 unique sentences used to retrain BioBERTptRT. Additionally, 100,021 sentences were manually labeled for use in fine-tuning the models This work contributes to the structuring of clinical notes by showcasing the performance improvements obtained through domain specialization in BERT networks. Additionally, it presents an analysis of the performance gains achieved by BioBERTpt compared to mBERT, both in our study and other related works. Furthermore, this study provides a valuable comparison between the distribution of sentences and the results obtained in this research and similar studies conducted in English.
Read full abstract