Abstract

ABSTRACTOBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs.METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated.RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used.CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.

Highlights

  • Like other countries, Brazil has a large volume of health data collected via national information systems and available in different databases

  • Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively

  • The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used

Read more

Summary

Introduction

Brazil has a large volume of health data collected via national information systems and available in different databases. Linkage between databases aims to identify if two or more records relate to the same entity, generally the same individual. This technique is used to identify duplications in a same file or between two or more files whose information must be gathered in a single database[12]. Deterministic record linkage uses exact comparison functions and classification based on rules developed from the knowledge of specialists[6]. Specific computational routines must be developed for each problem Low data quality, such the occurrence of missing data and typos, can contribute to the mismatch of variables, the importance of evaluating the accuracy of database linkage techniques

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call