Good Statistical Practices in Agronomy Using Categorical Data Analysis, with Alfalfa Examples Having Poisson and Binomial Underlying Distributions

Ronald P Mowers,Zhanyou Xu,Yuanyuan Cao,Bruna Bucciarelli,Deborah A Samac

doi:10.3390/crops2020012

Ronald P Mowers, Zhanyou Xu + Show 3 more

Open Access

https://doi.org/10.3390/crops2020012

Copy DOI

Abstract

Categorical data derived from qualitative classifications or countable quantitative data are common in biological scientific work and crop breeding. Categorical data analyses are important for drawing correct inferences from experiments. However, categorical data can introduce unique issues in data analysis. This paper discusses common problems arising from categorical variable analysis and modeling, demonstrates the issues or risks of misapplying analysis, and suggests approaches to address data analysis challenges using two data sets from alfalfa breeding programs. For each data set, we present several analysis methods, e.g., simple t-test, analysis of variance (ANOVA), split plot analysis, generalized linear model (glm), generalized linear mixed model (glmm) using R with R markdown, and with the standard statistical analysis software SAS/JMP. The goal is to demonstrate good analysis practices for categorical data by comparing the potential ‘bad’ analyses with better ones, avoiding too much reliance on reaching a significant p-value of 0.05, and navigating the morass of ever-increasing numbers of potential R functions. The three main aspects of this research focus on choosing the right data distribution to use, using the correct error terms for hypothesis test p-values including the right type of sum of the squares (Type I, II, and III), and proper statistical models for categorical data analysis. Our results show the importance of good statistical analysis practice to help agronomists, breeders, and other researchers apply appropriate statistical approaches to draw more accurate conclusions from their data.

Full Text