OCTess: AN OPTICAL CHARACTER RECOGNITION ALGORITHM FOR AUTOMATED DATA EXTRACTION OF SPECTRAL DOMAIN OPTICAL COHERENCE TOMOGRAPHY REPORTS.

Michael Balas,Rajeev H Muni,Marko M Popovic,Isabela M Melo,Jack Longwell,Josh Herman,Nishaant Shaan Bhambra

doi:10.1097/iae.0000000000003990

Abstract

Manual extraction of spectral domain optical coherence tomography (SD-OCT) reports is time and resource intensive. This study aimed to develop an optical character recognition (OCR) algorithm for automated data extraction from Cirrus SD-OCT macular cube reports. SD-OCT monocular macular cube reports (n = 675) were randomly selected from a single-center database of patients from 2020 to 2023. Image processing and bounding box operations were performed, and Tesseract (an OCR library) was used to develop the algorithm, OCTess. The algorithm was validated using a separate test data set. The long short-term memory deep learning version of Tesseract achieved the best performance. After reverifying all discrepancies between human and algorithmic data extractions, OCTess achieved accuracies of 100.00% and 99.98% in the training (n = 125) and testing (n = 550) datasets, while the human error rate was 1.11% (98.89% accuracy) and 0.49% (99.51% accuracy) in each, respectively. OCTess extracted data in 3.1 seconds, compared with 94.3 seconds per report for human evaluators. We developed an OCR and machine learning algorithm that extracted SD-OCT data with near-perfect accuracy, outperforming humans in both accuracy and efficiency. This algorithm can be used for efficient construction of large-scale SD-OCT data sets for researchers and clinicians.

Full Text