Abstract

The structure of a document contains rich information such as logical relations in context, hierarchy, affiliation, dependence, and applicability. It will greatly affect the accuracy of document information processing, particularly of legal documents and business contracts. Therefore, intelligent document structural analysis is important to information extraction and data mining. However, unlike the well-studied field of text semantic analysis, current work in document structural analysis is still scarce. In this paper, we propose an intelligent document structural analysis framework through data pre-processing, feature engineering, and structural classification with a dynamic sample weighting algorithm. As a typical application, we collect more than 11,000 insurance document content samples and carry out the machine learning experiments to check the efficiency of our framework. Meanwhile, to address the sample imbalance problem in the hierarchy classification task, a dynamic sample weighting algorithm is incorporated into our Dynamic Weighting Structural Analysis (DWSA) framework, in which the weights of different category tags according to the structural levels are iterated dynamically in training. Our results show that the DWSA has significantly improved the comprehensive accuracy and the classification F1-score of each category. The comprehensive accuracy is as high as 94.68% (3.36% absolute improvement) and the Macro F1-score is 88.29% (5.1% absolute improvement).

Highlights

  • With the rapid growth of insurance policies, consumers and insurance industry practitioners need to effectively extract useful information from a vast number of insurance documents

  • We use the techniques of data pre-processing, feature engineering, and structural classification to process the documents

  • The comprehensive accuracy reaches 94.68% (3.36% absolute improvement)

Read more

Summary

Introduction

With the rapid growth of insurance policies, consumers and insurance industry practitioners need to effectively extract useful information from a vast number of insurance documents. For time and manpower consuming activities such as identifying the key terms and differences between similar policies, intelligent document processing methods are highly desirable. In the specific field of insurance industry, Chinese insurance documents pose extra challenges because most exist in unstructured forms and the text features are often mixed with some noise, leading to failure of feature recognition when processing documents. Further work is needed to automatically process unstructured insurance documents to extract key information and transform them into structured data without losing important features. This paper proposes an intelligent information extraction framework for Chinese insurance documents. The proposed model can provide a convenient technology platform for insurance practitioners or related researchers.

Related Studies and the Current Contribution
DWSA Framework
Approach
Data Pre-Processing
Feature Engineering
Dynamic Weighting Algorithm
Results and Discussion
Results of Some Existing Algorithms
The DWSA Experimental Result
Experimental Summary
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.