Abstract
The rapid increase in the number of chemical science literature has brought challenges to researchers in search and data analysis. For many chemical scientific literature, extracting information from text and using knowledge is the focus of research. However, the existing Chinese text word segmentation methods have low recognition rate for chemical terms. The reason is that the addition of many new vocabulary and mixed professional vocabulary of Chinese and English brings challenges to word segmentation. In this paper, we propose a word segmentation method of Chinese chemical literature based on hybrid feature fusion learning Model(HFFLM). HFFLM first establishes chemical science corpus (Chem-pku) to train Chinese word segmentation (CWS) tasks. In addition, HFFLM uses BiLSTM and CNN to extract document features and fuse them. Then, HFFLM combines boundary features to construct conditional random field to train the end-to-end CWS model. In the end, HFFLM makes visual analysis of the word segmentation results. The experimental results indicate that HFFLM has high accuracy and recall rate, and is suitable for chemical industry vocabulary extraction with mixed Chinese and English.
Highlights
In recent years, scientific research in the chemical industry has focused on monitoring data generated in the chemical production process, such as raw material data parameters, manufacturing process parameters, equipment electromechanical parameters, and abnormal diagnostic parameters [1]
This article first introduces the research significance and related work of chemical science literature word segmentation; describes the construction process of the hybrid feature fusion model and the Chinese chemical industry literature word segmentation process; the Microsoft Asia Research Institute’s MSR corpus and the customized chemical science literature corpus Chempku are used as Experimental data, use Hidden Markov Model (HMM), Conditional Random Fields (CRF), IDCNN_CRF, BiLSTM_CRF, BiLSTM-BiLSTM, BiLSTM-CNN and HFFLM for chemical industry literature segmentation, and analyze the advantages of the proposed model and future work based on the experimental results
In view of the drawbacks of the above researches, we propose a method of chemical literature knowledge extraction based on hybrid feature fusion (HFFLM) to extract information from chemical science literature
Summary
Scientific research in the chemical industry has focused on monitoring data generated in the chemical production process, such as raw material data parameters, manufacturing process parameters, equipment electromechanical parameters, and abnormal diagnostic parameters [1]. Facing the different needs of chemical experts, we can effectively extract relevant information from chemical literature to obtain more meaningful data and build a professional search engine for the chemical industry This is of great help to the academic research of experts. This article first introduces the research significance and related work of chemical science literature word segmentation; describes the construction process of the hybrid feature fusion model and the Chinese chemical industry literature word segmentation process; the Microsoft Asia Research Institute’s MSR corpus and the customized chemical science literature corpus Chempku are used as Experimental data, use HMM, CRF, IDCNN_CRF, BiLSTM_CRF, BiLSTM-BiLSTM, BiLSTM-CNN and HFFLM for chemical industry literature segmentation, and analyze the advantages of the proposed model and future work based on the experimental results
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.