Deriving the Distribution and Exploring the Utility of Partial $$R^2$$ in the Era of Big Data

Gregory S Hawk,Katherine L Thompson

doi:10.1007/s44199-024-00074-y

Abstract

AbstractA central goal in the world of statistics and data science is the construction of linear regression models for continuous variables of interest. Often, our objective is to examine the impact of one or more explanatory variables, after adjusting for demographic covariates or other known/relevant factor(s). While the traditional approach is to use hypothesis testing to determine statistical significance, the p-values obtained are heavily dependent on sample size. This is particularly problematic for large datasets or “overpowered” studies, where even the tiniest of effects will appear to be highly significant. Computing capabilities and cloud-enhanced data sharing have revolutionized the way we use data worldwide, from healthcare and investments to manufacturing and retail. While machine learning and artificial intelligence are improving predictive analytics, we need better statistical inference to help understand and translate our models into meaningful and actionable insights. The coefficient of partial determination (or partial$$\varvec{R^{2}}$$ R 2 ) is widely used in applied science to supplement hypothesis testing, but little work has been done to understand its statistical properties. In this work, we derive the complete distribution of partial $$R^{2}$$ R 2 and perform simulated and real-world data analyses to show the advantages of adding it to your next analysis of Big Data.

Full Text