An automated data verification approach for improving data quality in a clinical registry

Qi Tian,Mengzhou Liu,Lingtong Min,Jiye An,Xudong Lu,Huilong Duan

doi:10.1016/j.cmpb.2019.01.012

Abstract

Background and ObjectiveThe quality of data is crucial for clinical registry studies as it impacts credibility. In the regular practice of most such studies, a vulnerability arises from researchers recording data on paper-based case report forms (CRFs) and further transcribing them onto registry databases. To ensure the quality of data, verifying data in the registry is necessary. However, traditional manual data verification methods are time-consuming, labor-intensive and of limited-effect. As paper-based CRFs and electronic medical records (EMRs) are two sources for verification, we propose an automated data verification approach based on the techniques of optical character recognition (OCR) and information retrieval to identify data errors in a registry more efficiently. MethodsThree steps are involved to develop the automated verification approach. First, we analyze the scanned images of paper-based CRFs with machine learning enhanced OCR to recognize the checkbox marks and hand-writing. Then, we retrieve the related patient information from the EMRs using natural language processing (NLP) techniques. Finally, we compare the retrieved information in the previous two steps with the data in the registry, and synthesize the results accordingly. The proposed automated method has been applied in a Chinese registry study and the difference between automated and manual approach has been evaluated. ResultsThe automated approach has been implemented in The Chinese Coronary Artery Disease Registry. For CRF data recognition, the accuracy of recognition for checkboxes marks and hand-writing are 0.93 and 0.74, respectively. For EMR data extraction, the accuracy of information retrieval from textual electronic medical records is 0.97. The accuracy, recall and time consumption of the automated approach are 0.93, 0.96 and 0.5 h, better than the corresponding values of the manual approach, which are 0.92, 0.71 and 7.5 h. ConclusionsCompared to the manual data verification approach, the automated approach enhances the recall of identify data errors and has a higher accuracy. The time consumed is far less. The results show that the automated approach is more effective and efficient for identifying incomplete data and incorrect data in a registry. The proposed approach has potential to improve the quality of registry data.

Full Text