Abstract
The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3. The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis.
Highlights
With the advent of new communication media and various applications like social media, mobile applications, and digital marketing, the data produced does not have a typical format or predefined schema like the standard data and cannot be managed with the relational database models
We found few studies mentioned the advantage of Bi-directional Long Short Term Memory (Bi-LSTM) and Conditional Random Fields (CRF) for information extraction task
In this Systematic Literature Review (SLR), we reviewed a large number of research papers on automatic information extraction from unstructured documents
Summary
With the advent of new communication media and various applications like social media, mobile applications, and digital marketing, the data produced does not have a typical format or predefined schema like the standard data and cannot be managed with the relational database models. Data is generated in various forms such as text, audio, videos, emails, and images. The organizations can use data analysis tools to better understand the customer needs and forecast market variations. D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI FIGURE 5. A. OPTICAL CHARACTER RECOGNITION (OCR) The manual extraction of text from the unstructured documents such as scanned PDF is not scalable and error-prone, as humans tend to get tired and make mistakes. The organizations have recently tried to use template-based approaches such as OCR to automate the document processing. OCR is used to recognize the text within an image; usually, a scanned printed or handwritten document. OCR can automatically sort various document types and organize them according to the particular rules. Classifying and managing invoices based on the type of product or vendor
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.