Abstract

Ecological studies and epidemiology need to use group averaged data to make inferences about individual patterns. However, using correlations based on averages to estimate correlations of individual scores is subject to an "ecological fallacy". The purpose of this article is to create distributions of Pearson R correlation values computed from grouped averaged or aggregate data using Monte Carlo simulations and random sampling. We show that, as the group size increases, the distributions can be approximated by a generalized hypergeometric distribution. The expectation of the constructed distribution slightly underestimates the individual Pearson R value, but the difference becomes smaller as the number of groups increases. The approximate normal distribution resulting from Fisher's transformation can be used to build confidence intervals to approximate the Pearson R value based on individual scores from the Pearson R value based on the aggregated scores.

Highlights

  • The relationship between the Pearson R and regression coefficients computed from individual scores and the Pearson R and regression coefficients computed from grouped averaged or aggregate scores has been the subject of many papers [4, 6, 8]

  • We show through Monte Carlo simulations that the distribution of Rx,ȳ can be approximated with a function based on the generalized hypergeometric function

  • We have constructed the partition distribution of the Pearson Rx,ȳ coefficients generated from group averaged values using random sampling and Monte Carlo simulations

Read more

Summary

Introduction

The relationship between the Pearson R and regression coefficients computed from individual scores and the Pearson R and regression coefficients computed from grouped averaged or aggregate scores has been the subject of many papers [4, 6, 8]. Knapp [9] relates within-aggregate (i.e. within group) correlations Rw, the correlation coefficient Rx,ȳ based on the group averages, and the correlation coefficient Rindividual based on the individual scores using the equation. Piantadosi, Byar and Green [12] perform an analysis of individual and aggregate correlations and regression slopes and state that regression slopes based on aggregates are likely to be more accurate approximations of regression slopes based on individual values compared to correlation coefficients. Our objective is to describe the distribution of the Pearson Rx,ȳ coefficients generated from the group averaged values and random sampling. The normal distribution can be used to construct confidence intervals to approximate the individual Rindividual coefficient based on Rx,ȳ assuming random sampling is used when constructing the group averages

Sampling from a bivariate distribution
Bivariate distribution of averages
Defining a partition
Confidence intervals for large group sizes m
Assessing the accuracy of the confidence intervals
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call