Migration Status Gradients in Immigrant Poverty: A Comparison of Imputation Methods
Research on the stratifying effects of migration status has increased sharply in the last two decades, although efforts have been hampered by the near absence of representative data that include detailed migration status measures. Researchers have developed various statistical and logical imputation methods that have produced widely varying estimates. In this article, we introduce a new indicator of migration status constructed from two federal surveys matched to the Social Security Administration's Numident file, a database that includes all citizens and legal residents of the United States. In models predicting poverty, our measure produces estimates comparable to those based on respondents’ own self-reports, in one federal survey, of their migration status. Both the administrative and survey-based measures produce poverty gradients that diverge from those produced by logic-based measures. Our findings contribute to mounting evidence of bias in the use of certain kinds of logic-based algorithms to impute migration status and demonstrate the promise of administrative record linkages in migration status research.
- Research Article
5
- 10.15415/mjis.2013.12015
- Mar 2, 2013
- Mathematical Journal of Interdisciplinary Sciences
Many existing, industrial, and research data sets contain missing values (MVs). There are various reasons for their existence, such as manual data entry procedures, equipment errors, and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data are prepared and cleaned , in order to be useful to and sufficiently clear for the knowledge extraction process. MVs make the performance of data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, in appropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study and can also limit the generalize ability of the research findings. The various types of problem are usually associated with MVs in data mining are (1) loss of efficiency;(2) complications in handling and analyzing the data; and(3) bias resulting from differences between missing and complete data. We will focus our attention on the use of imputation methods. A fundamental advantage of this approach is that the MV treatment is independent of the learning algorithm used. For this reason, the user can select the most appropriate method for each situation he faces. In this paper different methods of estimation of missing values are discussed. The comparison of different imputation methods are given by using non parametric methods.
- Research Article
7
- 10.1080/00949650903437842
- May 1, 2011
- Journal of Statistical Computation and Simulation
It is well known that if a multivariate outlier has one or more missing component values, then multiple imputation (MI) methods tend to impute nonextreme values and make the outlier become less extreme and less likely to be detected. In this paper, nonparametric depth-based multivariate outlier identifiers are used as criteria in a numerical study comparing several established methods of MI as well as a new proposed one, nine in all, in a setting of several actual clinical laboratory data sets of different dimensions. Two criteria, an ‘outlier recovery probability’ and a ‘relative accuracy measure’, are developed, based on depth functions. Three outlier identifiers, based on Mahalanobis distance, robust Mahalanobis distance, and generalized principle component analysis are also included in the study. Consequently, not only the comparison of imputation methods but also the comparison of outlier detection methods is accomplished in this study. Our findings show that the performance of an MI method depends on the choice of depth-based outlier detection criterion, as well as the size and dimension of the data and the fraction of missing components. By taking these features into account, an MI method for a given data set can be selected more optimally.
- Research Article
- 10.51541/nicel.1307183
- Dec 31, 2023
- Nicel Bilimler Dergisi
In this study, imputation methods used by IWGPS (*) organizations in CPI (Consumer Price Index) calculations in case of missing data are discussed. Depending on the development of technological devices, methods suitable for the demand of collecting data and producing statistics from the field immediately, in a way that can be adapted to the data collection tools of statistics offices have been proposed. While the immediate imputation advantages of the proposed methods are mentioned, the proposed imputation results are compared with the method results used in the current practice and imputation results of cellwise outlier and missing data in the statistical computer programming language. The method3(i_m üd19) proposed to assist the imputation tools used in CPI calculation all over the world and produced from the statistics is intended to provide convenience to all users. It can also be considered as a common weighted imputation method for both cellwise outlier and missing data case.
- Research Article
169
- 10.1049/iet-its.2013.0052
- Feb 1, 2014
- IET Intelligent Transport Systems
Many traffic management and control applications require highly complete and accurate data of traffic flow. However, because of various reasons such as sensor failure or transmission error, it is common that some traffic flow data are lost. As a result, various methods were proposed by using a wide spectrum of techniques to estimate missing traffic data in the last two decades. Generally, these missing data imputation methods can be categorised into three kinds: prediction methods, interpolation methods and statistical learning methods. To assess their performance, these methods are compared from different aspects in this paper, including reconstruction errors, statistical behaviours and running speeds. Results show that statistical learning methods are more effective than the other two kinds of imputation methods when data of a single detector is utilised. Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects. Numerical tests demonstrate that PPCA can be used to impute data online before making further analysis (e.g. make traffic prediction) and is robust to weather changes.
- Research Article
565
- 10.1186/1471-2288-6-57
- Dec 1, 2006
- BMC Medical Research Methodology
BackgroundMissing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS).Methods1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation). Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1) multiple imputation, 2) single regression, 3) individual mean, 4) overall mean, 5) participant's preceding response, and 6) random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated.ResultsWhen 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89), although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range (0.76 and 0.74 respectively).ConclusionMultiple imputation is the most accurate method for dealing with missing data in most of the missind data scenarios we assessed for the SDS. Imputing the individual's mean is also an appropriate and simple method for dealing with missing data that may be more interpretable to the majority of medical readers. Researchers should consider conducting methodological assessments such as this one when confronted with missing data. The optimal method should balance validity, ease of interpretability for readers, and analysis expertise of the research team.
- Research Article
5
- 10.1016/j.ssmph.2020.100675
- Sep 30, 2020
- SSM - Population Health
RationaleA range of family and neighbourhood indicators of socioeconomic status and migrant status have been shown to be associated with risk of mental health l problems (MHP) in children. In this study we determined the independent contributions of these indicators.ObjectivesThe main objective is to examine independent associations of family and neighbourhood socioeconomic status indicators and migrant status with risk of MHP in children.MethodsWe analyzed data from an anonymous public health survey among 5010 parents/caretakers of children aged 4–12 years living in Rotterdam, The Netherlands, gathered in 2018. Outcome of interest was risk of MHP measured using the total difficulties score of the Strengths and Difficulties Questionnaire. Associations of parent-reported perceived financial difficulties, material deprivation (not being able to provide certain goods, or leisure, educational or cultural activities or care use for children due to financial restrictions), parental educational level, child's migrant status and neighbourhood socioeconomic status with risk of MHP and with the total difficulties score were assessed using multilevel multivariable logistic and linear regression models.ResultsIn total, 473 (9.5%) children had a high risk of MHP. We observed independent associations of perceived financial difficulties, material deprivation and parental educational level with risk of MHP and with an increase in total difficulties score (P < 0.05). Migrant status and neighbourhood socioeconomic status were not independently associated with risk of MHP or a change in total difficulties score.ConclusionsAlready in early life, perceived financial difficulties by parents, material deprivation reported by parents and lower parental education appeared to be independently associated with the risk of MHP in 4–12 year olds. Health professionals should be aware of the relatively higher risks in these subgroups and consider policies address this.
- Research Article
14
- 10.1016/j.ssmph.2022.101039
- Feb 4, 2022
- SSM - Population Health
BackgroundIt is important to provide insight in potential target groups for interventions to reduce socioeconomic inequalities in children's vegetable/fruit consumption. In earlier studies often single indicators of socioeconomic status (SES) or migrant status have been used. However, SES is a multidimensional concept and different indicators may measure different SES dimensions. Our objective is to explore multiple associations of SES indicators and migrant status with risk of a low vegetable/fruit consumption in a large multi-ethnic and socioeconomically diverse sample of children. MethodsWe included 5,010 parents of 4- to 12-year-olds from a Dutch public health survey administered in 2018. Cross-sectional associations of parental education, material deprivation, perceived financial difficulties, neighbourhood socioeconomic status (NSES) and migrant status with low (≤4 days a week) vegetable and fruit consumption in children were assessed using multilevel multivariable logistic regression models. Results are displayed as odds ratios (OR) with 95% confidence intervals (CI). ResultsOf the 4- to 12-year-olds, 22.1% had a low vegetable consumption and 11.9% a low fruit consumption. Low (OR 2.51; 95%CI: 2.05, 3.07) and intermediate (OR 1.83; 95%CI: 1.54, 2.17) parental education, material deprivation (OR 1.45; 95%CI: 1.19, 1.76), low NSES (OR 1.28; 95%CI: 1.04, 1.58) and a non-Western migrant status (OR 1.94; 95%CI: 1.66, 2.26) were associated with a higher risk of a low vegetable consumption. Low (OR 1.68; 95%CI: 1.31, 2.17) and intermediate (OR 1.39; 95%CI: 1.12, 1.72) parental education and material deprivation (OR 1.63; 95%CI: 11.27, 2.08) were also associated with a higher risk of a low fruit consumption. ConclusionOur findings indicate associations of multiple SES indicators and migrant status with a higher risk of a low vegetable/fruit consumption in children and thus help to identify potential target groups.
- Research Article
5
- 10.1007/s10389-011-0488-1
- Jan 20, 2012
- Journal of Public Health
Data regarding infectious diseases in migrant populations in Europe are scarce. We aimed to assess whether information on migration status is collected in countries of the European Union (EU) as part of their national surveillance systems for major infectious diseases (HIV/AIDS, tuberculosis (TB) and hepatitis B infection). Using different electronic sources we collected information about whether indicators of migration status were collected in national infectious diseases surveillance systems in European countries. Of 27 EU countries, migration status was recorded in all 27 countries for TB surveillance (100%), in 22 countries for HIV (~82%) and in 23 countries for AIDS (~85%). Eight of 20 countries (40%) recorded migration status in hepatitis B surveillance systems. The most commonly recorded indicator of migration status was country of birth. Among countries which conducted migrant specific surveillance, country of birth was collected in ~82% of TB, ~86% of HIV, and ~83% of AIDS surveillance systems. Other indicators of the migration status were ethnicity (used in HIV and AIDS surveillance) and citizenship (TB surveillance). We showed differences in how migration status is recorded in surveillance systems from European countries. This was especially true for tuberculoses and hepatitis B, whereas data collection as part of HIV/AIDS surveillance was nearly uniform. These results suggest the need for a more uniform reporting of migration status as part of infectious disease surveillance in EU countries.
- Research Article
2
- 10.1504/ijbidm.2007.015486
- Jan 1, 2007
- International Journal of Business Intelligence and Data Mining
This research addresses the effects of the neural network s-Sigmoid function on Knowledge Discovery of Databases (KDD) in the presence of imprecise data. ANOVA testing and Tukey's Honestly Significant Difference statistics are conducted to investigate the impact of two factors: level of data missingness and imputation method. Data mining is based upon searching the concatenation of multiple databases that usually contain some amount of missing data along with a percentage of inaccurate data and noise. Therefore, analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing.
- Preprint Article
- 10.22004/ag.econ.109894
- Jul 26, 2011
Several imputation approaches using a large sample and different levels of censoring are compared and contrasted following a multiple imputation methodology. The study not only discusses these imputation approaches, but also quantifies differences in price variability before and after price imputation, evaluates the performance of each method, and estimates and compares parameters and elasticities from a complete demand system. The study’s findings reveal that small variability among the mean prices from the various imputation approaches may result in relatively larger variability among the underlying parameter estimates of interest and the ultimately desired measures. This suggests that selection bias may be avoided by validating the imputation approaches and choosing the imputation method based on an analysis of the ultimately desired measures.
- Research Article
3
- 10.35631/jistm.729001
- Dec 1, 2022
- Journal of Information System and Technology Management
Missing data is a recurring issue in psychology questionnaire when a respondent does not respond to questions due to personal reasons. In general, two types of imputation techniques are used to replace missing data: single imputation and multiple imputation (MI). The single imputation technique generates a single value to impute each missing data. The simplest methods of single imputation are mean, mode and median. In contrast, the multiple imputation technique imputes each missing data several times resulting in multiple complete datasets. The most popular method in MI that can deal with numerical and categorical data type is the predictive mean matching (PMM). The aim of this article is to compare and visualize how the mode imputation method in the single imputation technique will lead to a biased data distribution and the PMM method in the MI techniques will reduce this issue. Both methods, mode imputation and PMM are often considered when dealing with categorical data types. The mode imputation replaces a missing data with the most frequent value of an item in a survey. Meanwhile, the predictive mean matching is an extension of regression model that apply donor selection strategy to replace a missing data. Results from bar charts visualize the multiple imputation shows less discrepancy between the original distribution and imputed distribution. Thus, in this research, it can be concluded that the PMM method in MI technique shows a less biased distribution than implementing the mode imputation method. A comparison of imputation methods with different missing rates on a survey dataset should be considered for future work.
- Research Article
9
- 10.1371/journal.pone.0138923
- Sep 28, 2015
- PLOS ONE
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.
- Research Article
3
- 10.1177/01979183221084333
- Apr 18, 2022
- International Migration Review
The self-reported number of workdays missed due to injury or illness, or sick days, is a reliable measure of health among working-aged adults. Although sick days is a relatively underexplored health-related outcome in migration studies, it can provide a multidimensional understanding of immigrant wellbeing and integration. Current understandings of the association between migration status and sick days are limited for two reasons. First, in the United States, few nationally representative surveys collect migration status information. Second, researchers lack consensus on the most reliable approach for assigning migration status. We use the 2008 Survey of Income and Program Participation (SIPP) to examine sick days and draw comparisons between two methods for assigning migration status—a logical approach and a survey approach. The logical method assigns migration status to foreign-born respondents based on characteristics such as government employment or welfare receipt, while the survey approach relies on self-reported survey responses. Sick days among immigrants was correlated with and predicted by other health conditions available in the SIPP. Comparisons of sick days by migration status vary based on migration assignment approach. Lawful Permanent Residents (LPRs) reported more sick days than non-LPRs and appear less healthy when migration status is assigned using the logical approach. The logical approach also produced a gap in sick days between LPRs and non-LPRs that is not replicated in the survey approach. The results demonstrate that if migration status is not measured directly in the data, interpretation of migration status effects should proceed cautiously.
- Research Article
47
- 10.1017/s2045796017000142
- Apr 10, 2017
- Epidemiology and Psychiatric Sciences
Inequalities in mental health are well documented using individual social statuses such as socioeconomic status (SES), ethnicity and migration status. However, few studies have taken an intersectional approach to investigate inequalities in mental health using latent class analysis (LCA). This study will examine the association between multiple indicator classes of social identity with common mental disorder (CMD). Data on CMD symptoms were assessed in a diverse inner London sample of 1052 participants in the second wave of the South East London Community Health study. LCA was used to define classes of social identity using multiple indicators of SES, ethnicity and migration status. Adjusted associations between CMD and both individual indicators and multiple indicators of social identity are presented. LCA identified six groups that were differentiated by varying levels of privilege and disadvantage based on multiple SES indicators. This intersectional approach highlighted nuanced differences in odds of CMD, with the economically inactive group with multiple levels of disadvantage most likely to have a CMD. Adding ethnicity and migration status further differentiated between groups. The migrant, economically inactive and White British, economically inactive classes both had increased odds of CMD. This is the first study to examine the intersections of SES, ethnicity and migration status with CMD using LCA. Results showed that both the migrant, economically inactive and the White British, economically inactive classes had a similarly high prevalence of CMD. Findings suggest that LCA is a useful methodology for investigating health inequalities by intersectional identities.
- Research Article
4
- 10.1186/s12874-024-02305-3
- Aug 30, 2024
- BMC Medical Research Methodology
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower’s distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower’s distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.