Abstract
Medical text classification assigns medical related text into different categories such as topics or disease types. Machine learning based techniques have been widely used to perform such tasks despite the obvious drawback in such “black box” approach, leaving no easy way to fine-tune the resultant model for better performance. We propose a novel constructive heuristic approach to generate a set of regular expressions that can be used as effective text classifiers. The main innovation of our approach is that we develop a novel regular expression based text classifier with both satisfactory classification performance and excellent interpretability. We evaluate our framework on real-world medical data provided by our collaborator, one of the largest online healthcare providers in the market, and observe the high performance and consistency of this approach. Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks. The proposed methodology improves the performance of baseline methods (Naive Bayes and Support Vector Machines) by 9% in precision and 4.5% in recall. We also evaluate the performance of modified regular expressions by human experts and demonstrate the potential of practical applications using the proposed method.
Highlights
Despite the popularity of Electronic Medical Record System, there are still a large amount of unstructured text data in medical domain
The regex-based classifier narrows the gap between macro and micro F0.5 given by Naive Bayes (NB) and Support Vector Machines (SVM) models, indicating that the regular expressions elevate the performance of the classes with fewer samples, with which machine learning models do not perform well in general
Regular expressions have long been used for text processing because of their expressiveness and flexibility
Summary
Despite the popularity of Electronic Medical Record System, there are still a large amount of unstructured text data in medical domain. The oral expression of medical terms is difficult to be processed by natural language processing (NLP) tools developed for ordinary text [7] To address these issues, we investigate an automated regular expression generation method to classify medical texts in order to provide informative and comprehensive human-like medical guidance. Medical text classification approaches should aim to achieve better performance (in terms of precision and recall, for example) and at the same time allow human experts to modify the solutions for even better results. Our regular expression based system is transparent and interpretable for domain experts to make further modifications, whereas a system that is using sophisticated and not easy-to-understand machine learning techniques may require additional efforts to achieve this goal.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.