Inference using complex data from surveys and experiments.

D Roland Thomas

doi:10.1037/h0078864

D Roland Thomas

https://doi.org/10.1037/h0078864

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

This paper focusses on methods for analyzing data, i.e., data that do not conform to the assumptions of independence and homoscedasticity on which many classical procedures are based. Primary attention will be given to regression analysis, with ANOVA as a special case, though reference to related work on loglinear models and logit analysis will also be made.Complex survey data typically arise from surveys involving stratification and several levels of unit selection, i.e., several levels of clustering involving, in area surveys for example, city blocks, dwellings within block, and individuals within dwellings. Since individuals within a cluster are likely to be more similar, one to another, than to individuals in different clusters, a simple statistical model based on independent observations is not appropriate. An additional complexity often encountered in large surveys is that the first level clusters, or primary sampling units (psu's) may be selected from the target population with unequal probability. Complex data also arise in experimental setups, for example when more than one animal from a litter is included in the experiment, or when an experiment includes measurements of both of a subject's eyes (Rosner, 1982) or both of a subject's ears (Coren and Hakstian, 1990). Major advances have been made over the last decade and a half in understanding the effects on classical statistical analyses of ignoring data complexity. Ignoring clustering can result in inflated Type I errors for test statistics (Scott and Holt, 1982; Rao and Scott, 1981, 1984; Rao and Thomas, 1988; Zumbo and Zimmerman, 1991). Ignoring the survey selection mechanism, i.e., the survey design, can in some cases, result in biased estimates of regression parameters (Nathan and Holt, 1980; Holt, Smith and Winter, 1980). Succinct reviews of these issues have been given by Nathan (1988) and by Nathan and Smith (1989). Various methods for analyzing complex data that take account of the complexity have now been developed, several of which are described in detail by Skinner, Holt and Smith (1989). These methods are not yet well known to psychologists and other behavioural researchers, and it is hoped that this paper will encourage these practitioners to familiarize themselves with the new analytic tools that are becoming available.The paper is organized around three sub - themes. First, the problems associated with using standard methods and software on complex data are discussed. A simple example explaining and illustrating the dangers of ignoring clustering is given in Section 2. The second sub - theme is that much of the work on alternative strategies for complex data analysis is based on an inferential frame work (design - based inference) that is fundamentally different from the model - based inference familiar to most psychologists. Sections 3, 4 and 5 of the paper provide an introduction to some aspects of design - based (or finite population) inference, and contrast it with the more familiar model - based approach. Examples are given. The third sub - theme relates to the analysis of complex experimental data. Though model - based inference is by far the most popular approach to analyzing experiments in psychology, the randomization approach is increasingly being advocated as an alternative (see the paper by May in this issue). In Section 6, it will be argued that design - based inference provides a third approach to analyzing some experimental setups involving clustered data. An example involving rat litters is described.The Effect of Ignoring Sample StructureThis section concentrates on the dangers of ignoring clustering, a common feature of complex survey and experimental data. Table 1 provides a hypothetical data set containing 12 observations of a single character y. The hypothesis to be tested is that the mean Greek not transcribed of the population from which the y values are drawn is equal to two. The second column of Table 1 presents the data with no information about sample structure, in which case the analyst can do little but assume independence and homoscedasticity of the observations, and try a one - sample t - test (here we ignore distributional subtleties). …

Full Text