In recent years, various datasets related to the phenotyping of sunflower genotypes have become increasingly accessible. However, one of the key challenges remains the efficient and accurate prediction of phenotypes based on genotypes in the context of climate change. Analyzing phenotypes at different levels of organization and detecting connections between phenotypes and genotypes require the integration and processing of large, diverse, and often noisy datasets. Machine learning offers a broad arsenal of methods and approaches for identifying predictive patterns in such data. Therefore, the research aimed to develop a methodology for the systematization of sunflower genotypes based on seed phenotypic characteristics using the data vector quantization method and neural networks. The study revealed the phenotypic characteristics of sunflower seeds from various genotypes selected by the Institute of Oilseed Crops of NAAS, grown in the southern Steppe of Ukraine, including seed length, width, thickness, seed mass, kernel mass, and seed coat cracking force. For this purpose, appropriate laboratory equipment was developed, including two modules for determining the morphological and rheological properties of seeds. The developed methodology for the systematization of sunflower genotypes based on seed phenotypic characteristics includes the following steps: measuring the characteristics of sunflower seeds from various samples (parental components); studying the mutual correlation of characteristics; conducting hierarchical cluster analysis of the data using the Ward's method; determining the optimal number of groups; performing k-means clustering using the vector quantization method; determining the correspondence of ranges of characteristics to the group; training a neural network to group the data by samples and created groups; verifying the adequacy of the neural network on test data. The developed methodology was tested, and the MLP 30-15-3 neural network for grouping data by samples and created groups of sunflower seeds was developed in the Statistica software package. The network's training efficiency was 99.4%, and such of testing and validation was 95.6% and 96.7%, respectively.
Read full abstract