Weak-PMLC: A large-scale framework for multi-label policy classification based on extremely weak supervision

Jiufeng Zhao,Rui Song,Chitao Yue,Zhenxin Wang,Hao Xu

doi:10.1016/j.ipm.2023.103442

Abstract

With the development of e-government, multiple local governments in China are developing Internet-based open policy platforms, and these online platforms need to automatically classify policies. Current policy classification methods are usually supervised models that require massive amounts of annotated policies, which can be expensive and difficult to obtain in practice. To alleviate the burden of human experts annotating a large number of policies, we propose a large-scale framework (Weak-PMLC) for multi-label policy classification based on extremely weak supervision, which does not rely on any labeled documents and uses only the label names of each category. Specifically, we first pre-train a language model (LM) on a given dataset to extend the LM from general to domain-specific. We then utilize the domain-specific LM to generate seed words semantically related to label names. Finally, following the category-related seed words, we generate massive pseudo-labeled policies as training data for high-performance supervised models. To verify the effectiveness of our proposed method, we created two new human-labeled datasets containing about 56k and 37k policies, respectively. We also define 59 label names, which are several key feature words used to summarize all collected policies. We show that Weak-PMLC achieves around 90% F1-scores on these two datasets and improves performance by 4% over state-of-the-art weakly supervised methods. Further experiments show that the proposed Weak-PMLC is even comparable to some supervised models.

Full Text