Extremely unbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe unbalance in proportions, such as extremely unbalanced case-control ratio, very low incidence rate of disease, heavily censored time-to-event data, and low-frequency or rare variants. In such scenarios, the statistic derived from hypothesis test using the classical statistical method, e.g., logistic regression model and Cox proportional hazard regression model, might deviate from theoretical asymptotic distribution, resulting in inflation or deflation of type I error. With the increased availability and exploration of resources from large-scale population cohorts in genome-wide association study (GWAS), there is a growing demand for effective and accurate statistical approaches to handle extremely unbalanced data in independent and non-independent samples. Our study introduces classical statistical methods in genetic statistics firstly, then, summarizes the failure of classical statistical methods in dealing with extremely unbalanced data through simulation experiments to draw researchers' attention to the extremely unbalanced data in GWAS.
Read full abstract