Improving prevalence estimation through data fusion: methods and validation.

Tomàs Aluja-Banet,Núria Brunsó,Anna Mompart-Penina,Josep Daunis-I-Estadella

doi:10.1186/s12911-015-0169-z

Abstract

BackgroundEstimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. We propose here to validate this approach through data fusion. Data Fusion is the process of integrating two sources of data into one combined file. It allows us to take even greater advantage of existing information collected in databases. Here, we use data fusion to improve the estimation of health prevalences for two primary health factors: cardiovascular diseases and diabetes.MethodsWe use a real data fusion operation on population health, where the imputation of basic health risk factors is used to enrich a large-scale survey on self-reported health status. We propose choosing the imputation methodology for this problem through a suite of validation statistics that assess the quality of the fused data. The compared imputation techniques have been chosen from among the main imputation methodologies: k-nearest neighbor, probabilistic modeling and regression. We use the 2006 Health Survey of Catalonia, which provides a complete report of the perceived health status. In order to deal with the uncertainty problem, we compare these methodologies under the single and multiple imputation frames.ResultsA suite of validation statistics allows us to discern the strengths and weaknesses of studied imputation methods. Multiple outperforms single imputation by providing better and much more stable estimates, according to the computed validation statistics. The summarized results indicate that the probabilistic methods preserve the multivariate structure better; sequential regression methods deliver greater accuracy of imputed data; and nearest neighbor methods end up with a more realistic distribution of imputed data.ConclusionsData fusion allows us to integrate two sources of information in order to take grater advantage of the available data. Multiple imputed sequential regression models have the advantage of grater interpretability and can be used for health policy. Under certain conditions, more accurate estimates of the prevalences can be obtained using fused data (the original data plus the imputed data) than just by using only the observed data.

Highlights

Estimation of health prevalences is usually performed with a single survey
Application to health survey data: the process For our imputation models, we have selected a parametric imputation method (using the Data Augmentation algorithm (DA)), a sequential regression of fusing variables (SQ-reg), and a stochastic hot deck imputation, which is classically obtained through the nearest neighbor algorithm (1nn)
In this work we have shown that Data Fusion allows us to integrate two sources of information in order to better take advantage of the available data

Summary

Introduction

Estimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. Data Fusion is the process of integrating two sources of data into one combined file. Overview of the problem Large-scale surveys based on interviews are used as a tool to assess the health of the population. These surveys provide large representative samples of the population of interest. Obtained data are based on questions and self-reported answers. This kind of data could lead to inaccurate and biased estimates of health condition and. Data Fusion techniques are used as a tool for integrating information from different sources in order to improve the estimation of the prevalences. Data fusion is a technological operation undertaken for specific operational purposes, with the aim of gaining more information

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Jun 24, 2015
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improving prevalence estimation through data fusion: methods and validation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns
Esther-Lydia Silva-Ramírez ... Manuel López-Coello
Applied Soft Computing | VOL. 29
Esther-Lydia Silva-Ramírez, et. al.Esther-Lydia Silva-Ramírez ... Manuel López-Coello
12 Dec 2014
Applied Soft Computing | VOL. 29

Comparing single and multiple imputation strategies for harmonizing substance use data across HIV-related cohort studies
Marjan Javanbakht ... Soyeon Kim
BMC Medical Research Methodology | VOL. 22
Marjan Javanbakht, et. al.Marjan Javanbakht ... Soyeon Kim
03 Apr 2022
BMC Medical Research Methodology | VOL. 22

Weighted multiple imputation of ethnicity data that are missing not at random in primary care databases
Tra My Pham ... Irene Petersen
International Journal of Population Data Science | VOL. 1
Tra My Pham, et. al.Tra My Pham ... Irene Petersen
13 Apr 2017
International Journal of Population Data Science | VOL. 1

Multiple imputation for non-response when estimating HIV prevalence using survey data.
Amos Chinomona ... Henry Mwambi
BMC public health | VOL. 15
Amos Chinomona, et. al.Amos Chinomona ... Henry Mwambi
16 Oct 2015
BMC public health | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving prevalence estimation through data fusion: methods and validation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making