Abstract

With the development of e-government, multiple local governments in China are developing Internet-based open policy platforms, and these online platforms need to automatically classify policies. Current policy classification methods are usually supervised models that require massive amounts of annotated policies, which can be expensive and difficult to obtain in practice. To alleviate the burden of human experts annotating a large number of policies, we propose a large-scale framework (Weak-PMLC) for multi-label policy classification based on extremely weak supervision, which does not rely on any labeled documents and uses only the label names of each category. Specifically, we first pre-train a language model (LM) on a given dataset to extend the LM from general to domain-specific. We then utilize the domain-specific LM to generate seed words semantically related to label names. Finally, following the category-related seed words, we generate massive pseudo-labeled policies as training data for high-performance supervised models. To verify the effectiveness of our proposed method, we created two new human-labeled datasets containing about 56k and 37k policies, respectively. We also define 59 label names, which are several key feature words used to summarize all collected policies. We show that Weak-PMLC achieves around 90% F1-scores on these two datasets and improves performance by 4% over state-of-the-art weakly supervised methods. Further experiments show that the proposed Weak-PMLC is even comparable to some supervised models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.