DODFMiner: An automated tool for Named Entity Recognition from Official Gazettes

Gabriel M.C Guimarães,Felipe X.B Da Silva,Andrei L Queiroz,Ricardo M Marcacini,Thiago P Faleiros,Vinicius R.P Borges,Luís P.F Garcia

doi:10.1016/j.neucom.2023.127064

Gabriel M.C Guimarães, Felipe X.B Da Silva + Show 5 more

https://doi.org/10.1016/j.neucom.2023.127064

Copy DOI

Abstract

Official gazettes are documents published by governments to publicize their actions, spanning long periods of time and making an important transparency mechanism. These documents have information on laws, contracts, and bidding processes, as well as on civil servants and their careers in public service. Automatic information extraction of these documents may contribute to public transparency, with two tasks being especially useful: the classification of the different segments of these documents, the so called acts; and the Named Entity Recognition (NER) within the acts. The variety of official gazettes and their patterns brings up the necessity of constructing different tools for specific gazettes. In this paper, we propose DODFMiner, a command-line interface tool to classify acts and extract named entities from the Official Gazette of the Federal District. The tool follows a 3-step approach: the pre-processing of the input data; text classification using rule-based systems with regular expressions; and NER with Machine Learning algorithms. It allows users to input JSON files and receive CSV as output, providing information that allows users to track government procurements through years, contracts duration and total amount, among others. We also propose a set of experiments to support the choice of models included in the tool, covering the classification and NER steps. Text classification achieved a mean F1-score of 0.778, while to the NER, we compared 3 different architectures, CRF with a mean F1-score of 0.851, CNN-biLSTM-CRF with 0.787 and CNN-CNN-LSTM with 0.841.

Full Text