Abstract

This study aims at realizing unsupervised term discovery in Chinese electronic health records (EHRs) by using the word segmentation technique. The existing supervised algorithms do not perform satisfactorily in the case of EHRs, as annotated medical data are scarce. We propose an unsupervised segmentation method (GTS) based on the graph partition principle, whose multi-granular segmentation capability can help realize efficient term discovery. A sentence is converted to an undirected graph, with the edge weights based on n-gram statistics, and ratio cut is used to split the sentence into words. The graph partition is solved efficiently via dynamic programming, and multi-granularity is realized by setting different partition numbers. A BERT-based discriminator is trained using generated samples to verify the correctness of the word boundaries. The words that are not recorded in existing dictionaries are retained as potential medical terms. We compared the GTS approach with mature segmentation systems for both word segmentation and term discovery. MD students manually segmented Chinese EHRs at fine and coarse granularity levels and reviewed the term discovery results. The proposed unsupervised method outperformed all the competing algorithms in the word segmentation task. In term discovery, GTS outperformed the best baseline by 17 percentage points (a 47% relative percentage of increment) on F1-score. In the absence of annotated training data, the graph partition technique can effectively use the corpus statistics and even expert knowledge to realize unsupervised word segmentation of EHRs. Multi-granular segmentation can be used to provide potential medical terms of various lengths with high accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.