Automating Data Analysis Methods in Epidemiology

George Choueiry,Pascale Salameh

doi:10.6339/jds.201901_17(1).0003

Abstract

Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.

Highlights

Statistical errors are abundant in medical literature, and it can be proven that most claimed research findings are false (Ioannidis, 2005)
The adjusted odds ratios for substance correlate with their 95% confidence intervals and p-values produced by our software are shown in figure 4 and 5
We found that data analysis can be rendered faster and more objective with automation by using a combination of programming by specific instructions coupled with machine learning techniques

Summary

Introduction

Statistical errors are abundant in medical literature, and it can be proven that most claimed research findings are false (Ioannidis, 2005). The community has built more than 12,000 packages for R that help solving a large variety of problems and provide data analysts with cutting edge technology * Huge this growth was in the last years, it only impacted a minority of researchers who know how to code, as R is command driven and has a steep learning curve (Ozgur, Colliau, Rogers, Hughes, & Myer-Tyson, 2017) which is a serious disadvantage for non-programmers (Khan, 2013). Tobacco smoking in the form of cigarettes and waterpipe is common among Lebanese students (Bejjani, El Bcheraoui, & Adib, 2012; El-Roueiheb et al, 2008) It has short-term respiratory and non-respiratory effects, causes addiction and leads to other form of drug use. A review of 19 studies shows that not all studies found a positive relationship between body mass index and smoking among adolescents (Potter, Pederson, Chan, Aubut, & Koval, 2004), and measures of child attachment to parent and parent involvement with the child’s school have a protective effect (Fleming et al, 2002)

Objectives

Methods

Results

Discussion

Conclusion