Abstract

Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into "sections of interest" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call