Automatic Conversion of Table Contents from PDF Technical Specification Documents into Database Using AI Optical Character Recognition (OCR)

Minji Park,Eul-Bum Lee,Jong-Hwi Hwang,Chae-Yeon Kim,Sowon Choi

doi:10.1007/978-981-19-3951-8_22

Abstract

AbstractThis study aimed to develop a technology to recognize and warehouse text data from a table containing technical documents relating to construction and plant projects. To this end, a table optical character recognition (OCR) technology was proposed to tag text data by recognizing the structure of the table and the context of the content in the table. For analysis, the table format was first classified into two patterns: T1 and T2. T1 refers to a table with only one step of the header, and T2 refers to a table with two phases of the header. The table OCR model extracts text in cell units of the table using the OpenCV engine after extracting data from the headers. A training model improves text recognition rate through a long short-term memory (LSTM)-based Tesseract OCR engine. Extracted data from the table were stored in the DB and output in CSV format. The confusion matrix was applied to verify the recognition accuracy of the extracted data, and as a result of the verification, the F-measure value of T1 was 96%, and T2 was 87%. Therefore, from the outcome of this study, it is expected that the automated management of tasks that hitherto relied solely on the engineer’s experience will subsequently contribute to reducing the workload and improving the productivity of the engineer in charge.KeywordsAutomatic conversion of table contentsTable OCROpenCVTesseract OCR

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic Conversion of Table Contents from PDF Technical Specification Documents into Database Using AI Optical Character Recognition (OCR)

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Penerapan Metode Long Short-Term Memory Pada Pendataan Warga Berbasis Android
Muhammad Fakhri Pratama ... Sri Lestari
Journal of Computer System and Informatics (JoSYC) | VOL. 3
Muhammad Fakhri Pratama, et. al.Muhammad Fakhri Pratama ... Sri Lestari
27 Aug 2022
Journal of Computer System and Informatics (JoSYC) | VOL. 3

Advancing machine learning with OCR2SEQ: an innovative approach to multi-modal data augmentation
Michael Lowe ... Taghi M Khoshgoftaar
Journal of Big Data | VOL. 11
Michael Lowe, et. al.Michael Lowe ... Taghi M Khoshgoftaar
13 Jun 2024
Journal of Big Data | VOL. 11

Web based management information system with optical character recognition technology for a phlippine accounting firm
Jinky B Tumasis
South Asian Journal of Engineering and Technology | VOL. 12
Jinky B TumasisJinky B Tumasis
31 Mar 2022
South Asian Journal of Engineering and Technology | VOL. 12

Improved neural network OCR based on preprocessed blob classes
Lucian-Ovidiu Fedorovici ... Daniel Iercan
-
Lucian-Ovidiu Fedorovici, et. al.Lucian-Ovidiu Fedorovici ... Daniel Iercan
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Conversion of Table Contents from PDF Technical Specification Documents into Database Using AI Optical Character Recognition (OCR)

Abstract

Talk to us

Similar Papers