Abstract

For binary regression model with observed responses (Y s), spec ified predictor vectors (Xs), assumed model parameter vector (β) and case probability function (Pr(Y = 1|X, β)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold τ ∈ [0, 1], we consider classi fying each subject with predictor X into Y ∗=1 or 0 (a deterministic binary variable other than the observed random binary variable Y ) according to whether the calculated case probability (Pr(Y = 1|X, β)) under hypothe sized true model ≥ or < τ . For each τ , we check the difference between the expected marginal classification error rate (false positives [Y ∗=1, Y =0] or false negatives [Y ∗=0, Y =1]) under hypothesized true model with the ob served marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting τ -specific marginal error rates (expected and observed) versus τ ∈ [0, 1]. Inconsistency indicates lack of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n −1/2 )) and free of τ . The smallest homo geneous variation at each τ potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link func tion misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.

Highlights

  • IntroductionGiven observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule

  • Weichung Joe Shih and Junfeng LiuFor linear regression models with parameter vector (β), k-dimen sional predictor vectors (Xi, i = 1, . . . , n) and responses (Yi, i = 1, . . . , n), hypothesis test often involves a null (H0) and an alternative (Ha) assumption

  • The motivation comes from subject classification by case (Y =1) probability Pr(Y = 1|X, β), where Y is the random binary response for any subject with predictor vector X and parameter vector β, and the probability function Pr for Bernoulli trial could be of any form with range [0,1]

Read more

Summary

Introduction

Given observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule. Any pair-wise τ -specific error rate difference beyond 95% error bound (significant difference) pinpoints model discrepancy between the assumed true model and the observed binary responses Among these τ s, a large portion of significant difference indicates bad model fit, otherwise a good model fit is very likely obtained. CERC enjoys minimal-variation homogeneity across thresholds (τ s) and applies well to binary regression model goodness-of-fit test under large sample size (n) and continuous predictor variables (Xs). The rest of this article is organized as follows: Section 2 introduces subject classification by case probability for binary regression model; Section 3 develops a classification-based criterion for binary model goodness-of-fit test; Section 4 demonstrates the usefulness of this simple approach by simulations; Section 5 develops some theoretical results along with numerical examples to show that ROC curve can not be used for binary regression model goodness-of-fit test; and Section 6 concludes with discussion

Subject Classification by Case Probability
Result
Application to Screening Binary Models
Simulation Study
Distinction under same link functions
Distinction among different link functions
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call