Abstract
For binary regression model with observed responses (Y s), spec ified predictor vectors (Xs), assumed model parameter vector (β) and case probability function (Pr(Y = 1|X, β)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold τ ∈ [0, 1], we consider classi fying each subject with predictor X into Y ∗=1 or 0 (a deterministic binary variable other than the observed random binary variable Y ) according to whether the calculated case probability (Pr(Y = 1|X, β)) under hypothe sized true model ≥ or < τ . For each τ , we check the difference between the expected marginal classification error rate (false positives [Y ∗=1, Y =0] or false negatives [Y ∗=0, Y =1]) under hypothesized true model with the ob served marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting τ -specific marginal error rates (expected and observed) versus τ ∈ [0, 1]. Inconsistency indicates lack of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n −1/2 )) and free of τ . The smallest homo geneous variation at each τ potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link func tion misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.
Highlights
IntroductionGiven observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule
Weichung Joe Shih and Junfeng LiuFor linear regression models with parameter vector (β), k-dimen sional predictor vectors (Xi, i = 1, . . . , n) and responses (Yi, i = 1, . . . , n), hypothesis test often involves a null (H0) and an alternative (Ha) assumption
The motivation comes from subject classification by case (Y =1) probability Pr(Y = 1|X, β), where Y is the random binary response for any subject with predictor vector X and parameter vector β, and the probability function Pr for Bernoulli trial could be of any form with range [0,1]
Summary
Given observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule. Any pair-wise τ -specific error rate difference beyond 95% error bound (significant difference) pinpoints model discrepancy between the assumed true model and the observed binary responses Among these τ s, a large portion of significant difference indicates bad model fit, otherwise a good model fit is very likely obtained. CERC enjoys minimal-variation homogeneity across thresholds (τ s) and applies well to binary regression model goodness-of-fit test under large sample size (n) and continuous predictor variables (Xs). The rest of this article is organized as follows: Section 2 introduces subject classification by case probability for binary regression model; Section 3 develops a classification-based criterion for binary model goodness-of-fit test; Section 4 demonstrates the usefulness of this simple approach by simulations; Section 5 develops some theoretical results along with numerical examples to show that ROC curve can not be used for binary regression model goodness-of-fit test; and Section 6 concludes with discussion
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.