A solution to minimum sample size for regressions.

David G Jenkins,Pedro F Quintana-Ascencio

doi:10.1371/journal.pone.0229345

David G Jenkins, Pedro F Quintana-Ascencio

Open Access

https://doi.org/10.1371/journal.pone.0229345

Copy DOI

Journal: PLOS ONE	Publication Date: Feb 21, 2020
Citations: 371	License type: CC BY 4.0

Affiliation: University of Central Florida

Abstract

Regressions and meta-regressions are widely used to estimate patterns and effect sizes in various disciplines. However, many biological and medical analyses use relatively low sample size (N), contributing to concerns on reproducibility. What is the minimum N to identify the most plausible data pattern using regressions? Statistical power analysis is often used to answer that question, but it has its own problems and logically should follow model selection to first identify the most plausible model. Here we make null, simple linear and quadratic data with different variances and effect sizes. We then sample and use information theoretic model selection to evaluate minimum N for regression models. We also evaluate the use of coefficient of determination (R2) for this purpose; it is widely used but not recommended. With very low variance, both false positives and false negatives occurred at N < 8, but data shape was always clearly identified at N ≥ 8. With high variance, accurate inference was stable at N ≥ 25. Those outcomes were consistent at different effect sizes. Akaike Information Criterion weights (AICc wi) were essential to clearly identify patterns (e.g., simple linear vs. null); R2 or adjusted R2 values were not useful. We conclude that a minimum N = 8 is informative given very little variance, but minimum N ≥ 25 is required for more variance. Alternative models are better compared using information theory indices such as AIC but not R2 or adjusted R2. Insufficient N and R2-based model selection apparently contribute to confusion and low reproducibility in various disciplines. To avoid those problems, we recommend that research based on regressions or meta-regressions use N ≥ 25.

Highlights

IntroductionAll researchers seek to avoid their work being cast into the first definition of limbo, often by increasing sample size (N) and by applying increasingly sophisticated analytical techniques
Limbo: (1) A place or state of neglect, oblivion, or uncertainty; (2) A dance or contest that involves bending over backwards to pass under a low horizontal barAll researchers seek to avoid their work being cast into the first definition of limbo, often by increasing sample size (N) and by applying increasingly sophisticated analytical techniques
Low sample size contributes to problems of reproducibility, including false positives and false negatives and apparently contributes to uncertainty in biology and medical sciences [8,9,11,12,14,16]

Summary

Introduction

All researchers seek to avoid their work being cast into the first definition of limbo, often by increasing sample size (N) and by applying increasingly sophisticated analytical techniques. In an era of big data, this may seem to be a former problem. It remains vital because multiple disciplines use data that are hard to acquire and/or aggregated. It is difficult to collect data on species diversity among multiple islands with different areas. A similar problem occurs where data are aggregated, as in metaanalyses, systematic or quantitative reviews, and meta-regressions to evaluate general patterns across multiple studies (e.g., [1,2,3,4]). A regression computed with those aggregated data is called a meta-regression, and bears the same fundamental principles and assumptions as for a regression of the island diversity data

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A solution to minimum sample size for regressions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Gear selectivity and sample size effects on growth curve selection in shark age and growth studies
James T Thorson ... Colin A Simpfendorfer
Fisheries Research | VOL. 98
James T Thorson, et. al.James T Thorson ... Colin A Simpfendorfer
09 Apr 2009
Fisheries Research | VOL. 98

On the impact of model selection on predictor identification and parameter inference
Ruth M Pfeiffer ... Raymond J Carroll
Computational Statistics | VOL. 32
Ruth M Pfeiffer, et. al.Ruth M Pfeiffer ... Raymond J Carroll
22 Oct 2016
Computational Statistics | VOL. 32

Getting it right matters! Covid-19 pandemic analogies to everyday life in medical sciences.
Tomas L Bothe ... Andreas Patzak
Acta Physiologica | VOL. 233
Tomas L Bothe, et. al.Tomas L Bothe ... Andreas Patzak
14 Jul 2021
Acta Physiologica | VOL. 233

IDENTIFYING GENETIC ASSOCIATIONS WITH VARIABILITY IN METABOLIC HEALTH AND BLOOD COUNT LABORATORY VALUES: DIVING INTO THE QUANTITATIVE TRAITS BY LEVERAGING LONGITUDINAL DATA FROM AN EHR.
Shefali S Verma ... Ingrid Borecki
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing | VOL. 22
Shefali S Verma, et. al.Shefali S Verma ... Ingrid Borecki
22 Nov 2016
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A solution to minimum sample size for regressions.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE