Abstract

AbstractThis study aimed to develop a technology to recognize and warehouse text data from a table containing technical documents relating to construction and plant projects. To this end, a table optical character recognition (OCR) technology was proposed to tag text data by recognizing the structure of the table and the context of the content in the table. For analysis, the table format was first classified into two patterns: T1 and T2. T1 refers to a table with only one step of the header, and T2 refers to a table with two phases of the header. The table OCR model extracts text in cell units of the table using the OpenCV engine after extracting data from the headers. A training model improves text recognition rate through a long short-term memory (LSTM)-based Tesseract OCR engine. Extracted data from the table were stored in the DB and output in CSV format. The confusion matrix was applied to verify the recognition accuracy of the extracted data, and as a result of the verification, the F-measure value of T1 was 96%, and T2 was 87%. Therefore, from the outcome of this study, it is expected that the automated management of tasks that hitherto relied solely on the engineer’s experience will subsequently contribute to reducing the workload and improving the productivity of the engineer in charge.KeywordsAutomatic conversion of table contentsTable OCROpenCVTesseract OCR

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.