Fair conformal prediction for incomplete covariate data
Fair conformal prediction for incomplete covariate data
- Conference Article
- 10.1109/icqr2mse.2011.5976602
- Jun 1, 2011
This paper presents a new procedure to perform survival analysis when some covariate data are not available. A neural network hazard model is utilized here to model the relationship between covariates and the hazard. In order to consider incomplete covariates, the hidden layer target data are represented to be binary random variables. This will enable the training of the two-layer neural network hazard model to be decomposed into training of two single-layer structures. The training of input-hidden structure now becomes the logistic estimation problem with part of the input and all the output (the hidden layer target) missing. However, there are two major problems for this logistic estimation. It requires assumption about the distribution of the partially observed covariates. In addition, estimation for the logistic function will become complicated when the input data has missing values. Therefore, Instead of logistic function, the general location model is adopted to represent the mixed data set which involves missing values. The training of input-hidden structure thus becomes maximisation of the likelihood of mixed continuous data (covariates) and categorical data (hidden layer targets) within the general location model. The hidden layer targets link the two single structures and are updated iteratively. After each update, the expected values of the hidden layer targets are then used for the training of hidden-output structure of the neural network hazard model. This structure is now same as a generalised linear model (GLM) and is trained by the iteratively reweighted least squares (IRLS) approach. The training for both input-hidden and hidden-output structures will iterate until the estimation is converged. This new approach is applied to a group of bearing data. Parts of the data are deleted deliberately to create different realisations of incomplete covariate set. The numerical study demonstrates that this new approach is capable of handling the incomplete covariate data in the survival analysis and its results outperform those of conventional incomplete covariates handling approaches.
- Research Article
36
- 10.1002/(sici)1097-0258(19970115)16:1<57::aid-sim471>3.0.co;2-s
- Jan 15, 1997
- Statistics in Medicine
In evaluating prognostic factors by means of regression models, missing values in the covariate data are a frequent complication. There exist statistical tools to analyse such incomplete data in an efficient manner, and in this paper we make use of the traditional maximum likelihood principle. As well as an analysis including the incompletely measured covariates, such tools also allow further strategies of data analysis. For example, we can use surrogate variables to improve the prediction of missing values or we can try to investigate a questionable "missing at random' assumption. We discuss these techniques using the example of a clinical study where one important covariate is missing for about half the subjects. Additionally we consider two further issues: evaluation of differences between estimates from a complete case analysis and analyses using all subjects and assessment of the predictive value of missing values.
- Research Article
12
- 10.1016/j.jmva.2010.06.010
- Jun 17, 2010
- Journal of Multivariate Analysis
Multivariate logistic regression with incomplete covariate and auxiliary information
- Research Article
5
- 10.1002/sim.5581
- Aug 28, 2012
- Statistics in Medicine
Studies of chronic diseases routinely sample individuals subject to conditions on an event time of interest. In epidemiology, for example, prevalent cohort studies aiming to evaluate risk factors for survival following onset of dementia require subjects to have survived to the point of screening. In clinical trials designed to assess the effect of experimental cancer treatments on survival, patients are required to survive from the time of cancer diagnosis to recruitment. Such conditions yield samples featuring left-truncated event time distributions. Incomplete covariate data often arise in such settings, but standard methods do not deal with the fact that individuals' covariate distributions are also affected by left truncation. We describe an expectation-maximization algorithm for dealing with incomplete covariate data in such settings, which uses the covariate distribution conditional on the selection criterion. We describe an extension to deal with subgroup analyses in clinical trials for the case in which the stratification variable is incompletely observed.
- Research Article
60
- 10.1198/jasa.2010.tm08551
- Mar 1, 2010
- Journal of the American Statistical Association
Longitudinal studies often feature incomplete response and covariate data. It is well known that biases can arise from naive analyses of available data, but the precise impact of incomplete data depends on the frequency of missing data and the strength of the association between the response variables and covariates and the missing-data indicators. Various factors may influence the availability of response and covariate data at scheduled assessment times, and at any given assessment time the response may be missing, covariate data may be missing, or both response and covariate data may be missing. Here we show that it is important to take the association between the missing data indicators for these two processes into account through joint models. Inverse probability-weighted generalized estimating equations offer an appealing approach for doing this. Here we develop these equations for a particular model generating intermittently missing-at-random data. Empirical studies demonstrate that the consistent estimators arising from the proposed methods have very small empirical biases in moderate samples. Supplemental materials are available online.
- Research Article
37
- 10.1093/biomet/asn020
- Feb 4, 2008
- Biometrika
The R package spatstat provides a very flexible and useful framework for analysing spatial point patterns. A fundamental feature is a procedure for fitting spatial point process models depending on covariates. However, in practice one often faces incomplete observation of the covariates and this leads to parameter estimation error which is difficult to quantify. In this paper, we introduce a Monte Carlo version of the estimating function used in spatstat for fitting inhomogeneous Poisson processes and certain inhomogeneous cluster processes. For this modified estimating function, it is feasible to obtain the asymptotic distribution of the parameter estimators in the case of incomplete covariate information. This allows a study of the loss of efficiency due to the missing covariate data.
- Research Article
18
- 10.1590/s0102-311x2011001200003
- Dec 1, 2011
- Cadernos de Saúde Pública
Researchers in the health field often deal with the problem of incomplete databases. Complete Case Analysis (CCA), which restricts the analysis to subjects with complete data, reduces the sample size and may result in biased estimates. Based on statistical grounds, Multiple Imputation (MI) uses all collected data and is recommended as an alternative to CCA. Data from the study Saúde em Beagá, attended by 4,048 adults from two of nine health districts in the city of Belo Horizonte, Minas Gerais State, Brazil, in 2008-2009, were used to evaluate CCA and different MI approaches in the context of logistic models with incomplete covariate data. Peculiarities in some variables in this study allowed analyzing a situation in which the missing covariate data are recovered and thus the results before and after recovery are compared. Based on the analysis, even the more simplistic MI approach performed better than CCA, since it was closer to the post-recovery results.
- Research Article
218
- 10.1093/biomet/82.2.299
- Jan 1, 1995
- Biometrika
SUMMARY We consider regression analysis when incomplete or auxiliary covariate data are available for all study subjects and, in addition, for a subset called the validation sample, true covariate data of interest have been ascertained. The term auxiliary data refers to data not in the regression model, but thought to be informative about the true missing covariate data of interest. We discuss a method which is nonparametric with respect to the association between available and missing data, allows missingness to depend on available response and covariate values, and is applicable to both cohort and case-control study designs. The method previously proposed by Flanders & Greenland (1991) and by Zhao & Lipsitz (1992) is generalised and asymptotic theory is derived. Our expression for the asymptotic variance of the estimator provides intuition regarding performance of the method. Optimal sampling strategies for the validation set are also suggested by the asymptotic results.
- Research Article
37
- 10.1111/j.0006-341x.2001.00034.x
- Mar 1, 2001
- Biometrics
This article presents a new method for maximum likelihood estimation of logistic regression models with incomplete covariate data where auxiliary information is available. This auxiliary information is extraneous to the regression model of interest but predictive of the covariate with missing data. Ibrahim (1990, Journal of the American Statistical Association 85, 765-769) provides a general method for estimating generalized linear regression models with missing covariates using the EM algorithm that is easily implemented when there is no auxiliary data. Vach (1997, Statistics in Medicine 16, 57-72) describes how the method can be extended when the outcome and auxiliary data are conditionally independent given the covariates in the model. The method allows the incorporation of auxiliary data without making the conditional independence assumption. We suggest tests of conditional independence and compare the performance of several estimators in an example concerning mental health service utilization in children. Using an artificial dataset, we compare the performance of several estimators when auxiliary data are available.
- Research Article
- 10.5282/ubm/epub.1499
- Jan 1, 1998
Maximum likelihood estimation of regression parameters with incomplete covariate information usually requires a distributional assumption about the concerned covariates which implies a source of misspecification. Semiparametric procedures avoid such assumptions at the expense of efficiency. A simulation study is carried out to get an idea of the performance of the maximum likelihood estimator under misspecification and to compare the semiparametric procedures with the maximum likelihood estimator when the latter is based on a correct assumption.
- Research Article
3
- 10.1111/1467-9574.t01-1-00059
- Aug 1, 2002
- Statistica Neerlandica
ML–estimation of regression parameters with incomplete covariate information usually requires a distributional assumption regarding the concerned covariates that implies a source of misspecification. Semiparametric procedures avoid such assumptions at the expense of efficiency. In this paper a simulation study with small sample size is carried out to get an idea of the performance of the ML–estimator under misspecification and to compare it with the semiparametric procedures when the former is based on a correct assumption. The results show that there is only a little gain by correct parametric assumptions, which does not justify the possibly large bias when the assumptions are not met. Additionally, a simple modification of the complete case estimator appears to be nearly semiparametric efficient.
- Research Article
60
- 10.2307/2533852
- Sep 1, 1998
- Biometrics
Incomplete covariate data is a common occurrence in many studies in which the outcome is survival time. When a full likelihood is specified, a useful technique for obtaining parameter estimates is the EM algorithm. We propose a set of estimating equations to estimate the parameters of Cox's proportional hazards model when some covariate values are missing. These estimating equations can be solved by an algorithm similar to the EM algorithm. Because of the computational burden of finding a solution to these estimating equations, we propose obtaining parameter estimates via Monte Carlo methods. Asymptotic variances of the parameter estimates are also derived. We present a clinical trials example with three covariates, two of which have some missing values.
- Research Article
45
- 10.1007/bf00128467
- Jan 1, 1996
- Lifetime Data Analysis
Incomplete covariate data is a common occurrence in many studies in which the outcome is survival time. With generalized linear models, when the missing covariates are categorical, a useful technique for obtaining parameter estimates is the EM by the method of weights proposed in Ibrahim (1990). In this article, we extend the EM by the method of weights to survival outcomes whose distributions may not fall in the class of generalized linear models. This method requires the estimation of the parameters of the distribution of the covariates. We present a clinical trials example with five covariates, four of which have some missing values.
- Research Article
61
- 10.1001/jamaneurol.2022.4397
- Dec 5, 2022
- JAMA Neurology
Although consumption of ultraprocessed food has been linked to higher risk of cardiovascular disease, metabolic syndrome, and obesity, little is known about the association of consumption of ultraprocessed foods with cognitive decline. To investigate the association between ultraprocessed food consumption and cognitive decline in the Brazilian Longitudinal Study of Adult Health. This was a multicenter, prospective cohort study with 3 waves, approximately 4 years apart, from 2008 to 2017. Data were analyzed from December 2021 to May 2022. Participants were public servants aged 35 to 74 years old recruited in 6 Brazilian cities. Participants who, at baseline, had incomplete food frequency questionnaire, cognitive, or covariate data were excluded. Participants who reported extreme calorie intake (<600 kcal/day or >6000 kcal/day) and those taking medication that could negatively interfere with cognitive performance were also excluded. Daily ultraprocessed food consumption as a percentage of total energy divided into quartiles. Changes in cognitive performance over time evaluated by the immediate and delayed word recall, word recognition, phonemic and semantic verbal fluency tests, and Trail-Making Test B version. A total of 15 105 individuals were recruited and 4330 were excluded, leaving 10 775 participants whose data were analyzed. The mean (SD) age at the baseline was 51.6 (8.9) years, 5880 participants (54.6%) were women, 5723 (53.1%) were White, and 6106 (56.6%) had at least a college degree. During a median (range) follow-up of 8 (6-10) years, individuals with ultraprocessed food consumption above the first quartile showed a 28% faster rate of global cognitive decline (β = -0.004; 95% CI, -0.006 to -0.001; P = .003) and a 25% faster rate of executive function decline (β = -0.003, 95% CI, -0.005 to 0.000; P = .01) compared with those in the first quartile. A higher percentage of daily energy consumption of ultraprocessed foods was associated with cognitive decline among adults from an ethnically diverse sample. These findings support current public health recommendations on limiting ultraprocessed food consumption because of their potential harm to cognitive function.
- Research Article
216
- 10.1002/(sici)1097-0258(19970215)16:3<259::aid-sim484>3.0.co;2-s
- Feb 15, 1997
- Statistics in Medicine
Since Wu and Carroll (Biometrics 44, 175-188) proposed a model for longitudinal progression in the presence of informative dropout, several researchers have developed and studied models for situations where both a vector of repeated outcomes and an event time is available for each subject. These models have been developed for either longitudinal studies with dropout or for survival studies in which a random, time-varying covariate is measured repeatedly across time. When inference about the longitudinal variable is of interest, event times are treated as covariates and are often incomplete due to censoring. If survival or event time is the primary endpoint, repeated outcomes observed prior to the event are viewed as covariates; this covariate process is often incomplete, measured with error, or observed at unscheduled times during the study. We review several models which are used to handle incomplete response and covariate data in both survival and longitudinal studies.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.