Abstract

The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we develop an anonymization method for German financial and legal documents using state-of-the-art natural language processing methods based on recurrent neural nets and transformer architectures. We present a web-based application to anonymize financial documents and a large-scale evaluation of different deep learning techniques.

Highlights

  • The automatic processing of text documents has become of vital importance in several industrial applications

  • The evaluation on financial documents suggests that the RNN+conditional random field (CRF) achieves the best performance, at over 97% recall without post-processing and around 99% after post-processing, without compromising precision of over 90%

  • The results suggest that the RNN classifiers using a general language model performs better than one trained only on financial documents, which is expected since the sentences in GermEval correspond to sentences from a variety of sources

Read more

Summary

Introduction

The automatic processing of text documents has become of vital importance in several industrial applications. The availability of digital financial and legal documents is increasing and companies rely on automated methods for handling and analysis, often based on or assisted by machine learning tools. The development of such tools usually requires researchers and developers to have access to documents as part of data exploration or the model training pipeline. Such financial data typically cannot be processed or shared beyond authorized parties due to the prevalence of sensitive information regarding specific individuals and organizations, which significantly restricts development even within the organization. Locations, dates and other entities that make the inference of personal information possible, one remains with a document that is safe to distribute but still contains the original structure and language, leaving it suitable for analysis, training and prediction

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.