Abstract
Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.
Highlights
Drug name recognition (DNR), which recognizes pharmacological substances from biomedical texts and classifies them into predefined categories, is an essential prerequisite step for drug information extraction such as drug-drug interactions [1]
We investigate the effectiveness of conjunction features based on multiple kinds of singleton features to machine learning-based DNR systems
To investigate the effectiveness of feature conjunction and feature selection on DNR, we start with the system that uses only singleton features and feature conjunction and feature selection are successively performed
Summary
Drug name recognition (DNR), which recognizes pharmacological substances from biomedical texts and classifies them into predefined categories, is an essential prerequisite step for drug information extraction such as drug-drug interactions [1]. Drug names may contain a number of symbols mixed with common words, for example, “N-[N-(3, 5-difluorophenacetyl)-L-alanyl]-Sphenylglycine t-butyl ester.”. The ways of naming drugs vary greatly. The drug “valdecoxib” has the brand name “Bextra,” while its systematic International Union of Pure and Applied Chemistry (IUPAC) name is “4-(5-methyl-3-phenylisoxazol-4-yl)benzenesulfonamide.”. Due to the ambiguity of some pharmacological terms, it is not trivial to determine whether substances should be drugs or not. “insulin” is a hormone produced by the pancreas, but it can be synthesized artificially and used as drug to treat diabetes
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Computational and Mathematical Methods in Medicine
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.