An investigation of the impact of imbalance on the analysis of the US crop variety evaluation program data

Zhou Fang,Qian M Zhou,Johnie N Jenkins,Dewayne D Deng

doi:10.1002/csc2.21262

Abstract

AbstractMulti‐environment trial data from many crop variety evaluation programs are imbalanced because only a subset of varieties is selected for the following year, which leads to missing variety by year. Inspired by the US National Cotton Variety Test trial, we conducted new simulation studies to investigate selection processes that differ from the existing literature. The followings are our four main contributions. First, we adopted a framework that utilizes a logistic regression to generate imbalanced data that follow missing completely at random, missing at random, or missing not at random (MNAR). Second, our selection process can depend on multiple traits, whereas all existing studies only used a single trait for selection. Third, besides variance components (VCs), long‐term trends that reflect genetic and non‐genetic development are of interest since the simulated data span over 30 years. Last, we evaluated the prediction accuracy for variety's overall and location‐specific performance. The results show that the VC and long‐term trends estimations are the worst under MNAR using the single trait for selection. Compared to VC, the long‐term trends estimation is more influenced by the missing mechanism and missing rate. However, the prediction accuracy for variety's performance is mainly driven by the missing rate and is less sensitive to the selection process. If ignoring the genetic and non‐genetic long‐term trends, both estimation and prediction will deteriorate. More testing years would improve estimation and prediction, despite a higher missing rate.

Full Text