Can statistical learning models make early selection among sugarcane families easier and still efficient?

Édimo Fernando Alves Moreira,Luiz Alexandre Peternelli,Marcio Henrique Pereira Barbosa

doi:10.1002/csc2.20334

Abstract

AbstractThe selection of genotypes at the early stages is one of the main challenges facing sugarcane (Saccharum officinarum L.) breeding programs. The present work aimed to compare classification techniques, namely, logistic regression (LR), k‐nearest neighbor (KNN), random forests (RF), and support vector machine (SVM) against the selection among families of sugarcane via artificial neural networks (ANN) and via a procrefers to the families incorrectly selected byedure based on the weighing of the plots. The data used in this work were obtained from 110 families. In the families, the number of stalks (NS), stalk diameter (SD), and stalk height (SH) were collected, in addition to the actual yield, expressed in tons of cane per hectare (TCH). We considered the NS, SD, and SH as explanatory variables for the training of the classifiers. The response used was the indicator Y = 0 if the family is not selected via TCH or Y = 1 otherwise. To increase the efficiency in training, we produced synthetic data based on the simulation of NS, SD, SH, and TCH values. Two models were also considered: a full model with all the predictors and a reduced model without the SH. We used the apparent error rate (AER) and the true positive rate (TPR) for the evaluation of the classifiers. All classifiers present low values for the AER and high values for the TPR in both models. The best performance was observed in the SVM. The reduced model should be preferred, since its performance is very close to that of the full model and its operation is more straightforward.

Full Text