The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort

Victor Olago,Julia Bohlius,Lina Bartels,Tafadzwa Dhokotera,Mazvita Sengayi,Elvira Singh,Matthias Egger

doi:10.23889/ijpds.v5i5.1422

Abstract

IntroductionThe South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa. As the number of HIV records increases, there is need for more efficient ways of de-duplicating this big-data. In this work, we used clustering to perform big-data deduplication. Objectives and ApproachOur objective was to use DBSCAN as clustering algorithm together with bi-gram word analyser to perform big-data deduplication in resource-limited settings. We used HIV related laboratory records from entire South Africa collated in the NHLS Corporate Data Warehouse for period 2004-2014. This involved data pre-processing, deterministic deduplication, ngrams generation, features generation using Term Frequency Inverse Document Frequency vectorizer, clustering using DBSCAN and assigning cluster labels for records that potentially belonged to the same person. We used records with national identification numbers to assess quality of deduplication by calculating precision, recall and f-measure. ResultsWe had 51,563,127 HIV related laboratory records. Deterministic deduplication resulted in 20,387,819 patient record deduplicates. With DBSCAN clustering we further reduced this to 14,849,524 patient record clusters. In this final dataset, 3,355,544 (22.60%) patients had negative HIV test, 11,316,937 (76.21%) had evidence for HIV infection, and for 177,043 (1.19%) the HIV status could not be determined. The precision, recall and f-measure based on 1,865,445 records with national identification numbers were 0.96, 0.94 and 0.95, respectively. Conclusion / ImplicationsOur study demonstrated that DBSCAN clustering is an effective way of deduplicating big datasets in resource-limited settings. This enabled refining of an HIV observational database by accurately linking test records that potentially belonged to the same person. The methodology creates opportunities for easy data profiling to inform public health decision making.

Highlights

The South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa
We used HIV related laboratory records from entire South Africa collated in the NHLS Corporate Data Warehouse for period 2004-2014
This involved data pre-processing, deterministic deduplication, ngrams generation, features generation using Term Frequency Inverse Document Frequency vectorizer, clustering using Density-Based Spatial Clustering of Application With Noise (DBSCAN) and assigning cluster labels for records that potentially belonged to the same person

Summary

Introduction

The South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa. As the number of HIV records increases, there is need for more efficient ways of de-duplicating this big-data.

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Population Data Science

Lead the way for us

Journal: International Journal of Population Data Science	Publication Date: Dec 7, 2020
License type: CC BY 4.0

Similar Papers

Exploiting Taxi Demand Hotspots Based on Vehicular Big Data Analytics
Lu Zhang ... Xinping Guan
-
Lu Zhang, et. al.Lu Zhang ... Xinping Guan
01 Sep 2016
01 Sep 2016

A Modified DBSCAN Algorithm for Anomaly Detection in Time-series Data with Seasonality
Praphula Jain ... Mani Shankar Bajpai
The International Arab Journal of Information Technology | VOL. 19
Praphula Jain, et. al.Praphula Jain ... Mani Shankar Bajpai
01 Jan 2021
The International Arab Journal of Information Technology | VOL. 19

GNN-DBSCAN: A new density-based algorithm using grid and the nearest neighbor
Li Yihong ... Li Tao
Journal of Intelligent & Fuzzy Systems | VOL. 41
Li Yihong, et. al.Li Yihong ... Li Tao
16 Dec 2021
Journal of Intelligent & Fuzzy Systems | VOL. 41

Adaptive density-based spatial clustering of applications with noise (DBSCAN) according to data
Wei-Tung Wang ... Yi-Leh Wu
-
Wei-Tung Wang, et. al.Wei-Tung Wang ... Yi-Leh Wu
01 Jul 2015
01 Jul 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Use of Density-Based Spatial Clustering of Application With Noise (DBSCAN) for Record Linkage in An Observational HIV Cohort

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Population Data Science