A Graph Database Representation of Portuguese Criminal-Related Documents

Gonçalo Carnaz,Vitor Beires Nogueira,Mário Antunes

doi:10.3390/informatics8020037

Abstract

Organizations have been challenged by the need to process an increasing amount of data, both structured and unstructured, retrieved from heterogeneous sources. Criminal investigation police are among these organizations, as they have to manually process a vast number of criminal reports, news articles related to crimes, occurrence and evidence reports, and other unstructured documents. Automatic extraction and representation of data and knowledge in such documents is an essential task to reduce the manual analysis burden and to automate the discovering of names and entities relationships that may exist in a case. This paper presents SEMCrime, a framework used to extract and classify named-entities and relations in Portuguese criminal reports and documents, and represent the data retrieved into a graph database. A 5WH1 (Who, What, Why, Where, When, and How) information extraction method was applied, and a graph database representation was used to store and visualize the relations extracted from the documents. Promising results were obtained with a prototype developed to evaluate the framework, namely a name-entity recognition with an F-Measure of 0.73, and a 5W1H information extraction performance with an F-Measure of 0.65.

Highlights

Input: a set of documents that are retrieved from police departments and open sources, in Portable Document Format (.pdf), Microsoft Word (.doc) and HTML format; Document preprocessing: enables a set of tasks for document processing and Natural Language Processing; Graph database representation: enables the semantic understanding of data retrieved using Named Entity Recognition (NER), Criminal-Term Extraction, Semantic Role Labelling (SRL), and 5W1H information extraction methods
For NER evaluation, manual annotation was performed against a set of criminalrelated documents after annotating the documents by identifying and classifying each sentence named-entity and entity types
The focus is on the Portuguese language, without discarding what has been done in other languages; The approaches applied to the criminal domain and related works were studied and analyzed; A survey of existing ETL, NLP, Graph Database approaches was made and, for each one, a list was presented, with the features that can be proposed, used or adapted; The SEMCrime framework solves an emerging and ambitious challenge regarding the processing of Portuguese unstructured criminal reports files, mainly because it is applied to a domain without a solid background and relevant work-related to the Portuguese language, despite the works already published and applied to other cases such as the English language

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. A systematic approach that ties together the criminal investigation and the computer science domains, focused on the analysis of criminal-related documents in the Portuguese language; An end-to-end framework to deal with several phases ranging from data extraction to knowledge representation into a graph database These phases can be summarized as follows: Informatics 2021, 8, 37. Input: a set of documents that are retrieved from police departments and open sources (online news about crimes), in Portable Document Format (.pdf), Microsoft Word (.doc) and HTML format; Document preprocessing: enables a set of tasks for document processing and Natural Language Processing; Graph database representation: enables the semantic understanding of data retrieved using Named Entity Recognition (NER), Criminal-Term Extraction, Semantic Role Labelling (SRL), and 5W1H information extraction methods. A dataset built by a set of documents, such as police reports, criminal and PGdLisboa (Procuradoria-Geral Distrital de Lisboa, in English: District Attorney of Lisbon) news

Literature Review

Summary

SEMCrime Framework

Criminal-Related Documents

Preprocessing Criminal-Related Documents

Neo4j Criminal-Related Documents Representation

NER Module

Criminal Term Extraction Module

Semantic Role Labeling Module

Graph Database Population and Enrichment

Implementation and Results

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Informatics	Publication Date: Jun 4, 2021
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Graph Database Representation of Portuguese Criminal-Related Documents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatics

Lead the way for us

Similar Papers

Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction.
Feifei Liu ... Mingtong Liu
Quantitative Imaging in Medicine and Surgery | VOL. 13
Feifei Liu, et. al.Feifei Liu ... Mingtong Liu
01 Jun 2023
Quantitative Imaging in Medicine and Surgery | VOL. 13

Automatic data extraction to support meta-analysis statistical analysis: a case study on breast cancer
Faith Wavinya Mutinda ... Eiji Aramaki
BMC Medical Informatics and Decision Making | VOL. 22
Faith Wavinya Mutinda, et. al.Faith Wavinya Mutinda ... Eiji Aramaki
18 Jun 2022
BMC Medical Informatics and Decision Making | VOL. 22

Automatic Extraction of Standard Multimodal Knowledge for the Petroleum Field
Shan Huang ... Yangxin Xin
Highlights in Science, Engineering and Technology | VOL. 90
Shan Huang, et. al.Shan Huang ... Yangxin Xin
08 Apr 2024
Highlights in Science, Engineering and Technology | VOL. 90

Automatic knowledge extraction of any Chatbot from conversation
Sasa Arsovski ... Adrian David Cheok
Expert Systems With Applications | VOL. 137
Sasa Arsovski, et. al.Sasa Arsovski ... Adrian David Cheok
08 Jul 2019
Expert Systems With Applications | VOL. 137

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Graph Database Representation of Portuguese Criminal-Related Documents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatics