Unstructured Documents Research Articles

As a crucial national security defense line, the existing risk prevention and screening system of customs falls short in terms of intelligence and diversity for risk identification factors. Hence, the urgent issues to be addressed in the risk identification system include intelligent extraction technology for key information from Customs Unstructured Accompanying Documents (CUADs) and the reliability of the extraction results. In the customs scenario, OCR is employed for M2M interactions, but current models have difficulty adapting to diverse image qualities and complex customs document content. We propose a hybrid mutual learning knowledge distillation (HMLKD) method for optimizing a pre-trained OCR model’s performance against such challenges. Additionally, current models lack effective incorporation of domain-specific knowledge, resulting in insufficient text recognition accuracy for practical customs risk identification. We propose a customs domain knowledge graph (CDKG) developed using CUAD knowledge and propose an integrated CDKG post-OCR correction method (iCDKG-PostOCR) based on CDKG. The results on real data demonstrate that the accuracies improve for code text fields to 97.70%, for character type fields to 96.55%, and for numerical type fields to 96.00%, with a confidence rate exceeding 99% for each. Furthermore, the Customs Health Certificate Extraction System (CHCES) developed using the proposed method has been implemented and verified at Tianjin Customs in China, where it has showcased outstanding operational performance.

Read full abstract

The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3. The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis.

Read full abstract

Unstructured Documents Research Articles

Articles published on Unstructured Documents

TableExtractNet: A Model of Automatic Detection and Recognition of Table Structures from Unstructured Documents

Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

Analysis of Unstructured Document Data Extraction Technology

Supervised Learning Algorithm on Unstructured Documents for the Classification of Job Offers: Case of Cameroun

Tokengrid: Toward More Efficient Data Extraction From Unstructured Documents

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

Extraction of Sequence of Actions from Unstructured Requirements Specification Document

Innovative Model for Student Project Evaluation Based on Text Mining

Knowledge Extraction System from Unstructured Documents

MalayIK: An Ontological Approach to Knowledge Transformation in Malay Unstructured Documents

A Novel Method for Classification of Unstructured Documents by using Wordnet based Semantic Similarity

自然言語で記述された登山計画書の機械可読化に基づく登山計画書共有システムの実装

Application of Predictive and Descriptive Text Mining Techniques for Analysis and Organization of Unstructured Documents

Automatic Classification and Categorization: Application for Identifying and Thematic Analysing of Textual Unstructured Documents

Information Extraction from the Un-Structured Document using Grammatical Inference and Alignment Similarity

A Metadata-Based Approach for Unstructured Document Management in Organizations

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Unstructured Documents Research Articles

Articles published on Unstructured Documents

TableExtractNet: A Model of Automatic Detection and Recognition of Table Structures from Unstructured Documents

Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

Analysis of Unstructured Document Data Extraction Technology

Supervised Learning Algorithm on Unstructured Documents for the Classification of Job Offers: Case of Cameroun

Tokengrid: Toward More Efficient Data Extraction From Unstructured Documents

Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

Extraction of Sequence of Actions from Unstructured Requirements Specification Document

Innovative Model for Student Project Evaluation Based on Text Mining

Knowledge Extraction System from Unstructured Documents

MalayIK: An Ontological Approach to Knowledge Transformation in Malay Unstructured Documents

A Novel Method for Classification of Unstructured Documents by using Wordnet based Semantic Similarity

自然言語で記述された登山計画書の機械可読化に基づく登山計画書共有システムの実装

Application of Predictive and Descriptive Text Mining Techniques for Analysis and Organization of Unstructured Documents

Automatic Classification and Categorization: Application for Identifying and Thematic Analysing of Textual Unstructured Documents

Information Extraction from the Un-Structured Document using Grammatical Inference and Alignment Similarity

A Metadata-Based Approach for Unstructured Document Management in Organizations