Anonymization of German financial documents using neural network-based language models with contextual word representations

David Biesner,Christian Bauckhage,Max Lübbering,Rajkumar Ramamurthy,Maren Pielka,Lars Hillebrand,Rafet Sifa,Anna Ladi,Rüdiger Loitz,Robin Stenzel

doi:10.1007/s41060-021-00285-x

Abstract

The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we develop an anonymization method for German financial and legal documents using state-of-the-art natural language processing methods based on recurrent neural nets and transformer architectures. We present a web-based application to anonymize financial documents and a large-scale evaluation of different deep learning techniques.

Highlights

The automatic processing of text documents has become of vital importance in several industrial applications
The evaluation on financial documents suggests that the RNN+conditional random field (CRF) achieves the best performance, at over 97% recall without post-processing and around 99% after post-processing, without compromising precision of over 90%
The results suggest that the RNN classifiers using a general language model performs better than one trained only on financial documents, which is expected since the sentences in GermEval correspond to sentences from a variety of sources

Summary

Introduction

The automatic processing of text documents has become of vital importance in several industrial applications. The availability of digital financial and legal documents is increasing and companies rely on automated methods for handling and analysis, often based on or assisted by machine learning tools. The development of such tools usually requires researchers and developers to have access to documents as part of data exploration or the model training pipeline. Such financial data typically cannot be processed or shared beyond authorized parties due to the prevalence of sensitive information regarding specific individuals and organizations, which significantly restricts development even within the organization. Locations, dates and other entities that make the inference of personal information possible, one remains with a document that is safe to distribute but still contains the original structure and language, leaving it suitable for analysis, training and prediction

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Data Science and Analytics	Publication Date: Oct 2, 2021
Citations: 8	License type: open-access

R Discovery Prime

R Discovery Prime

Anonymization of German financial documents using neural network-based language models with contextual word representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Data Science and Analytics

Lead the way for us

Similar Papers

Neural network language models for low resource languages
Ankur Gandhe ... Ian Lane
-
Ankur Gandhe, et. al.Ankur Gandhe ... Ian Lane
14 Sep 2014
14 Sep 2014

Optimization of Neural Network Language Models for keyword search
Ankur Gandhe ... Florian Metze
-
Ankur Gandhe, et. al.Ankur Gandhe ... Florian Metze
01 May 2014
01 May 2014

Bag-of-words input for long history representation in neural network-based language models for speech recognition
Kazuki Irie ... Ralf Schlüter
-
Kazuki Irie, et. al.Kazuki Irie ... Ralf Schlüter
06 Sep 2015
06 Sep 2015

Author response: An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions
Sanne ten Oever ... Andrea E Martin
-
Sanne ten Oever, et. al.Sanne ten Oever ... Andrea E Martin
21 Jun 2021
21 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Anonymization of German financial documents using neural network-based language models with contextual word representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Data Science and Analytics