Abstract
The e-government platform not only enables the government department to publish policy texts online, but also makes it easier for users to access the policy, especially for the convenience of understanding the policies by reading the keywords. For a given policy text, keywords take up only a small proportion, which can be seen as an unbalanced data set. Therefore, in this paper, we try to design automatic keyword extraction method of policy text with unbalanced data set. In order to achieve this goal, we firstly propose a new ensemble oversampling method to synthesize new data. In this case, we sample data from the training set by using Bagging method. During each sampling process, we train a logistic regression model to classify the training set. Based on the predicted probabilities, we utilize the classification confidence to divide training set into three regions by using three-way decisions (3WD). Then, we implement different strategies to synthesize new data. Besides, for keyword extraction of policy text, we conduct a series of experiments by using the classical supervised and unsupervised methods. In our experiment results, we can find that both in the public data sets and manual data sets, our sampling method can achieve better performance of F-measure and G-mean indexes, no matter what the supervised machine learning method is. This can also explain the advantage of 3WD. Different regions have different strategies to synthesize new data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.