Abstract
Back to table of contents Next article Taking IssueFull AccessLarge Data Sets Can Be Dangerous!Robert E. Drake, M.D., Ph.D., and Gregory J. McHugo, Ph.D., Robert E. DrakeSearch for more papers by this author, M.D., Ph.D., and Gregory J. McHugoSearch for more papers by this author, Ph.D., Dartmouth Medical School, Hanover, New HampshirePublished Online:1 Feb 2003https://doi.org/10.1176/appi.ps.54.2.133AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack Citations ShareShare onFacebookTwitterLinked InEmail Researchers generally believe in the advantages of having more data, often as an antidote to problems with recruiting, retention, and statistical power. Yet the increasing availability of large administrative databases and computerized clinical records and the easy manipulation of data by computerized statistical packages have created a different set of problems that journal reviewers now encounter more commonly. Among the problems are poor quality of data, statistical significance without meaningfulness, the use of multiple tests that capitalize on chance, and post hoc interpretations.First, data collected for purposes other than research—for example, for billing or for clinical records—are, as a general rule, rarely of research quality. To complicate matters, researchers often have little information on the reliability and validity of such data. The danger is that invalid data are used for invalid analyses that lead to invalid conclusions—a common occurrence.Second, very large samples yield numerous statistically significant but meaningless associations for a variety of well-documented reasons, such as similar biases that apply across the measures. Statistically significant findings are unimportant when they reflect measurement errors or represent tiny differences that do not approach clinical significance. Without studying measurement accuracy and specifying a meaningful difference a priori, researchers sometimes synthesize a pattern of trivial findings into a publishable paper.Third, with computers and large data sets, the temptation to sift through numerous associations and pick out the ones that seem to fit the investigators' hypotheses—or, even worse, the ones that seem to cohere according to post hoc explanations—is ever present. Many investigators do not report all the tests they have run or all the variables they have examined and do not correct for multiple tests. The inevitable result is a proliferation of type 1 errors.Fourth, large existing data sets encourage investigators to look for research questions that fit the data—usually imperfectly—rather than find data that can answer a meaningful question. For example, investigators are tempted to use whatever comparison group exists rather than a group that makes sense on the basis of logic and a priori hypotheses.What is to be done? Researchers can emphasize research ethics, oversight by senior researchers, the criterion of common sense in research training, more quality and less quantity of publications, and adherence to scientific standards. Mental health journals are necessarily adopting new standards for disclosure and review, such as the use of effect sizes and corrections for multiple tests. FiguresReferencesCited byDetailsCited byEthically Using Administrative Data in Research13 December 2010 | Administration & Society, Vol. 43, No. 2Journal of Child and Family Studies, Vol. 20, No. 5Delinquent Behavior Across Adolescence: Investigating the Shifting Salience of Key Criminological Predictors20 December 2010 | Deviant Behavior, Vol. 32, No. 1Population health surveys: An introduction to basic conceptsInternational Journal of Therapy and Rehabilitation, Vol. 16, No. 4 Volume 54Issue 2 February 2003Pages 133-133 Metrics PDF download History Published online 1 February 2003 Published in print 1 February 2003
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.