A Simple Method for Screening Binary Models with Large Sample Size and Continuous Predictor Variables

Weichung Joe Shih,Junfeng Liu

doi:10.6339/jds.2009.07(4).500

Abstract

For binary regression model with observed responses (Y s), spec ified predictor vectors (Xs), assumed model parameter vector (β) and case probability function (Pr(Y = 1|X, β)), we propose a simple screening method to test goodness-of-fit when the number of observations (n) is large and Xs are continuous variables. Given any threshold τ ∈ [0, 1], we consider classi fying each subject with predictor X into Y ∗=1 or 0 (a deterministic binary variable other than the observed random binary variable Y ) according to whether the calculated case probability (Pr(Y = 1|X, β)) under hypothe sized true model ≥ or < τ . For each τ , we check the difference between the expected marginal classification error rate (false positives [Y ∗=1, Y =0] or false negatives [Y ∗=0, Y =1]) under hypothesized true model with the ob served marginal error rate which is directly observed due to this classification rule. The screening profile is created by plotting τ -specific marginal error rates (expected and observed) versus τ ∈ [0, 1]. Inconsistency indicates lack of-fit and consistence indicates good model fit. We note that, the variation of the difference between the expected marginal classification error rate and the observed one is constant (O(n −1/2 )) and free of τ . The smallest homo geneous variation at each τ potentially detects flexible model discrepancies with high power. Simulation study shows that, this profile approach named as CERC (classification-error-rate-calibration) is useful for checking wrong parameter value, incorrect predictor vector component subset and link func tion misspecification. We also provide some theoretical results as well as numerical examples to show that, ROC (receiver operating characteristics) curve is not suitable for binary model goodness-of-fit test.

Highlights

IntroductionGiven observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule
Weichung Joe Shih and Junfeng LiuFor linear regression models with parameter vector (β), k-dimen sional predictor vectors (Xi, i = 1, . . . , n) and responses (Yi, i = 1, . . . , n), hypothesis test often involves a null (H0) and an alternative (Ha) assumption
The motivation comes from subject classification by case (Y =1) probability Pr(Y = 1|X, β), where Y is the random binary response for any subject with predictor vector X and parameter vector β, and the probability function Pr for Bernoulli trial could be of any form with range [0,1]

Summary

Introduction

Given observed responses (Y s), predictors (Xs), assumed true model parameter (β) and case probability function (Pr), we will obtain two important values: 1) The expected marginal classification error rate (EMCER), i.e., the probability that a randomly selected subject out of the n individuals would be “expected” to be misclassified under the assumed true model; and 2) The observed marginal classification error rate (OMCER), i.e., the probability that a randomly selected subject out of n individuals is “observed” to be misclassified based on the classification rule. Any pair-wise τ -specific error rate difference beyond 95% error bound (significant difference) pinpoints model discrepancy between the assumed true model and the observed binary responses Among these τ s, a large portion of significant difference indicates bad model fit, otherwise a good model fit is very likely obtained. CERC enjoys minimal-variation homogeneity across thresholds (τ s) and applies well to binary regression model goodness-of-fit test under large sample size (n) and continuous predictor variables (Xs). The rest of this article is organized as follows: Section 2 introduces subject classification by case probability for binary regression model; Section 3 develops a classification-based criterion for binary model goodness-of-fit test; Section 4 demonstrates the usefulness of this simple approach by simulations; Section 5 develops some theoretical results along with numerical examples to show that ROC curve can not be used for binary regression model goodness-of-fit test; and Section 6 concludes with discussion

Subject Classification by Case Probability

Result

Application to Screening Binary Models

Simulation Study

Distinction under same link functions

Distinction among different link functions

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Simple Method for Screening Binary Models with Large Sample Size and Continuous Predictor Variables

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Science

Lead the way for us

Journal: Journal of Data Science	Publication Date: Jul 10, 2021
License type: cc-by

Similar Papers

Probit and logistic discriminant functions
A Albert ... J A Anderson
Communications in Statistics - Theory and Methods | VOL. 10
A Albert, et. al.A Albert ... J A Anderson
01 Jan 1981
Communications in Statistics - Theory and Methods | VOL. 10

Guidelines for improving the use and presentation of P values
Steven J Staffa ... David Zurakowski
The Journal of Thoracic and Cardiovascular Surgery | VOL. 161
Steven J Staffa, et. al.Steven J Staffa ... David Zurakowski
30 Apr 2020
The Journal of Thoracic and Cardiovascular Surgery | VOL. 161

FDG-PET zur Lymphknoten-Diagnostik des Lungenkarzinoms: Welche SUV-Schwelle ist sinnvoll?
D Hellwig ... C.-M Kirsch
Der Nuklearmediziner | VOL. 31
D Hellwig, et. al.D Hellwig ... C.-M Kirsch
01 Dec 2008
Der Nuklearmediziner | VOL. 31

Reconciling Algorithmic Fairness Criteria
Fabian Beigang
Philosophy & Public Affairs | VOL. 51
Fabian BeigangFabian Beigang
01 Apr 2023
Philosophy & Public Affairs | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Simple Method for Screening Binary Models with Large Sample Size and Continuous Predictor Variables

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Science