A cost-effective method for combining the power of genetic and epigenetic selection in animal production

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Traditional breeding programs have largely focused on genetics, often overlooking environmental and epigenetic influences on phenotypic variability. Current methods for developing epigenetic biomarkers (EBs) with machine learning (ML) algorithms require extensive data, making them costly and time-intensive. In this study, using a fish as a model, we analysed ~500 000 CpG loci in samples from 60 different families to develop EBs for broodstock selection. To address limited sample sizes at the sequencing stage, we combined careful sample selection, statistical filtering, and various feature selection and ML algorithms. As a result, we identified three heritable CpGs sites in sire sperm associated with three key performance indicators in their offspring: biomass, fast-growing females, and resistance to the masculinizing effects of high temperature. Then, we were able to build a model successfully predicting the best sire broodstock based on DNA methylation levels of these EBs. This model was validated across three independent trials, including one involving an external cohort of fish with differentiated genetic background, thereby confirming its robustness beyond the training population. Yield was increased up to 1.4-fold when including epigenetic selection into the genetic selection program as compared with genetic selection alone. In summary, we present a cost-effective strategy for integrating epigenetic and genetic selection in the context of animal production. Furthermore, this method also can be applied to assess the impact of environmental factors into the broodstock and on samples where obtaining information can be challenging, such as in the study of the epigenetic basis of rare diseases, and the application of epigenetic markers in conservation biology.

Similar Papers
  • Research Article
  • Cite Count Icon 5
  • 10.1080/23279095.2024.2382823
Machine and deep learning algorithms for classifying different types of dementia: A literature review
  • Jul 31, 2024
  • Applied Neuropsychology: Adult
  • Masoud Noroozi + 16 more

The cognitive impairment known as dementia affects millions of individuals throughout the globe. The use of machine learning (ML) and deep learning (DL) algorithms has shown great promise as a means of early identification and treatment of dementia. Dementias such as Alzheimer’s Dementia, frontotemporal dementia, Lewy body dementia, and vascular dementia are all discussed in this article, along with a literature review on using ML algorithms in their diagnosis. Different ML algorithms, such as support vector machines, artificial neural networks, decision trees, and random forests, are compared and contrasted, along with their benefits and drawbacks. As discussed in this article, accurate ML models may be achieved by carefully considering feature selection and data preparation. We also discuss how ML algorithms can predict disease progression and patient responses to therapy. However, overreliance on ML and DL technologies should be avoided without further proof. It’s important to note that these technologies are meant to assist in diagnosis but should not be used as the sole criteria for a final diagnosis. The research implies that ML algorithms may help increase the precision with which dementia is diagnosed, especially in its early stages. The efficacy of ML and DL algorithms in clinical contexts must be verified, and ethical issues around the use of personal data must be addressed, but this requires more study.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 58
  • 10.3390/rs12132110
Leaf Area Index Estimation Algorithm for GF-5 Hyperspectral Data Based on Different Feature Selection and Machine Learning Methods
  • Jul 1, 2020
  • Remote Sensing
  • Zhulin Chen + 10 more

Leaf area index (LAI) is an essential vegetation parameter that represents the light energy utilization and vegetation canopy structure. As the only in-operation hyperspectral satellite launched by China, GF-5 is potentially useful for accurate LAI estimation. However, there is no research focus on evaluating GF-5 data for LAI estimation. Hyperspectral remote sensing data contains abundant information about the reflective characteristics of vegetation canopies, but these abound data also easily result in a dimensionality curse. Therefore, feature selection (FS) is necessary to reduce data redundancy to achieve more reliable estimations. Currently, machine learning (ML) algorithms have been widely used for FS. Moreover, the same ML algorithm is usually conducted for both FS and regression in LAI estimation. However, no evidence suggests that this is the optimal solution. Therefore, this study focuses on evaluating the capacity of GF-5 spectral reflectance for estimating LAI and the performances of different combination of FS and ML algorithms. Firstly, the PROSAIL model, which coupled leaf optical properties model PROSPECT and the scattering by arbitrarily inclined leaves (SAIL) model, was used to generate simulated GF-5 reflectance data under different vegetation and soil conditions, and then three FS methods, including random forest (RF), K-means clustering (K-means) and mean impact value (MIV), and three ML algorithms, including random forest regression (RFR), back propagation neural network (BPNN) and K-nearest neighbor (KNN) were used to develop nine LAI estimation models. The FS process was conducted twice using different strategies: Firstly, three FS methods were conducted to search the lowest dimension number, which maintained the estimation accuracy of all bands. Then, the sequential backward selection (SBS) method was used to eliminate the bands having minimal impact on LAI estimation accuracy. Finally, three best estimation models were selected and evaluated using reference LAI. The results showed that although the RF_RFR model (RF used for feature selection and RFR used for regression) achieved reliable LAI estimates (coefficient of determination (R2) = 0.828, root mean square error (RMSE) = 0.839), the poor performance (R2 = 0.763, RMSE = 0.987) of the MIV_BPNN model (MIV used for feature selection and BPNN used for regression) suggested using feature selection and regression conducted by the same ML algorithm could not always ensure an optimal estimation. Moreover, RF selection preserved the most informative bands for LAI estimation so that each ML regression method could achieve satisfactory estimation results. Finally, the results indicated that the RF_KNN model (RF used as feature selection and KNN used for regression) with seven GF-5 spectral band reflectance achieved the better estimation results than others when validated by simulated data (R2 = 0.834, RMSE = 0.824) and actual reference LAI (R2 = 0.659, RMSE = 0.697).

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 35
  • 10.3390/math9212813
Android Malware Detection Using Machine Learning with Feature Selection Based on the Genetic Algorithm
  • Nov 5, 2021
  • Mathematics
  • Jaehyeong Lee + 3 more

Since the discovery that machine learning can be used to effectively detect Android malware, many studies on machine learning-based malware detection techniques have been conducted. Several methods based on feature selection, particularly genetic algorithms, have been proposed to increase the performance and reduce costs. However, because they have yet to be compared with other methods and their many features have not been sufficiently verified, such methods have certain limitations. This study investigates whether genetic algorithm-based feature selection helps Android malware detection. We applied nine machine learning algorithms with genetic algorithm-based feature selection for 1104 static features through 5000 benign applications and 2500 malwares included in the Andro-AutoPsy dataset. Comparative experimental results show that the genetic algorithm performed better than the information gain-based method, which is generally used as a feature selection method. Moreover, machine learning using the proposed genetic algorithm-based feature selection has an absolute advantage in terms of time compared to machine learning without feature selection. The results indicate that incorporating genetic algorithms into Android malware detection is a valuable approach. Furthermore, to improve malware detection performance, it is useful to apply genetic algorithm-based feature selection to machine learning.

  • Research Article
  • Cite Count Icon 2
  • 10.1093/eurheartj/ehab849.176
Machine learning to predict in-hospital mortality risk among heterogenous STEMI patients with diabetes
  • Feb 4, 2022
  • European Heart Journal
  • S Kasim + 3 more

Funding Acknowledgements Type of funding sources: Public grant(s) – National budget only. Main funding source(s): TECHNOLOGY DEVELOPMENT FUND 1 Background Diabetes has become a major public health concern in Asia. In Malaysia, the prevalence of diabetes has escalated in adults above the age of 18, affecting 3.9 million individuals. Patients with diabetes and coronary heart disease have worse outcomes, compared with patients without diabetes who have coronary heart disease. Conventional Risk scores such as TIMI and GRACE were derived from a Western Caucasian cohort with limited data from Asian countries, despite Asia hosting 60% of the world’s population. Purpose It is important to recognize the significant features associated with in-hospital mortality risk that is population-specific in Asian diabetes patients with STEMI to achieve a reliable and effective clinical diagnosis and improved outcome. Electronic health records contain large amounts of information on patients’ medical history and are becoming invaluable research tools that could be applied to cardiovascular disease risk prediction through machine learning (ML) algorithms. With the current success of ML over conventional methods in STEMI mortality prediction, we aim to develop ML algorithms for in-hospital risk mortality in Asian patients diagnosed with DM that can be adopted for clinical predictions Methods We used registry data from the Malaysian National Cardiovascular Disease Database of 5783 patients diagnosed with DM from 2006 to 2016. Fifty parameters including demographics, cardiovascular risk, medications and clinical variables were considered. Four machine learning (ML) algorithms were constructed using a 70% registry dataset; Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Booster (XGB) and Logistic Regression (LR). Feature selections were done based on ML algorithms feature importance combined with Sequential Backward Selection (SBS). The area under the curve (AUC) was used as the performance evaluation metric. All algorithms were validated using a 30 % validation dataset and compared to the conventional TIMI risk score for STEMI. Results The best model SVM (AUC = 0.90) outperformed other ML algorithms (Figure 1) and TIMI risk score (AUC = 0.83). The best SVM model consists of 11 predictors which are Killip class, fasting blood glucose, age, systolic blood pressure, heart rate, ACE inhibitor, beta-blocker, total cholesterol, diastolic blood pressure, lower density lipoprotein, and diuretic (Figure 2). Common predictors of SVM and TIMI risk score are Killip class, age, systolic blood pressure, and heart rate. We have shown that the population-specific data mining approach for the prediction of diabetes patients’ mortality post-STEMI outperformed conventional TIMI risk score. Conclusion In the Asian multiethnic population, combination of ML approaches with features selection demonstrated promising outcomes in patients with DM that may be used for better patient prognostic than the conventional method. Abstract Figure 1: ML Best Model Performance Abstract Figure 2: Selected Predictors for ML

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 38
  • 10.3390/s23187709
Emerging Technologies for 6G Communication Networks: Machine Learning Approaches.
  • Sep 6, 2023
  • Sensors (Basel, Switzerland)
  • Annisa Anggun Puspitasari + 3 more

The fifth generation achieved tremendous success, which brings high hopes for the next generation, as evidenced by the sixth generation (6G) key performance indicators, which include ultra-reliable low latency communication (URLLC), extremely high data rate, high energy and spectral efficiency, ultra-dense connectivity, integrated sensing and communication, and secure communication. Emerging technologies such as intelligent reflecting surface (IRS), unmanned aerial vehicles (UAVs), non-orthogonal multiple access (NOMA), and others have the ability to provide communications for massive users, high overhead, and computational complexity. This will address concerns over the outrageous 6G requirements. However, optimizing system functionality with these new technologies was found to be hard for conventional mathematical solutions. Therefore, using the ML algorithm and its derivatives could be the right solution. The present study aims to offer a thorough and organized overview of the various machine learning (ML), deep learning (DL), and reinforcement learning (RL) algorithms concerning the emerging 6G technologies. This study is motivated by the fact that there is a lack of research on the significance of these algorithms in this specific context. This study examines the potential of ML algorithms and their derivatives in optimizing emerging technologies to align with the visions and requirements of the 6G network. It is crucial in ushering in a new era of communication marked by substantial advancements and requires grand improvement. This study highlights potential challenges for wireless communications in 6G networks and suggests insights into possible ML algorithms and their derivatives as possible solutions. Finally, the survey concludes that integrating Ml algorithms and emerging technologies will play a vital role in developing 6G networks.

  • Research Article
  • 10.1049/icp.2023.0546
Classification of Benign and Malignant Breast Tumor Using Machine Learning and Feature Selection Algorithms
  • Feb 23, 2023
  • IET Conference Proceedings
  • Z Ha Shehab + 2 more

Breast cancer is among the most frequent kinds of cancer that may be detected early and treated with a high probability of complete recovery before the disease's progression. The only way to save and decrease breast cancer mortality is by early detection, identification, and efficient treatment. The proper categorization of breast tumors is critical in the practice of medical diagnosis. This study builds different breast tumor classification models based on machine learning algorithms. Support Vector Machine (SVM), k-Nearest Neighbours (KNN), and Random Forest (RF) classifiers are used to build a set of breast tumor classification models. Each classifier is evaluated individually before and after applying a set of feature selection methods to a public breast tumor dataset. Moreover, a set of hybrid machine learning models are created using stacking approach. Results show that machine learning algorithms with feature selection techniques can be effectively used to build breast tumor classification models. The highest f-measure value is 97.30%, which obtained by combining SVM classifier with CLAE as feature selection.

  • Research Article
  • Cite Count Icon 21
  • 10.1097/jce.0000000000000359
Ovarian Cancer Classification Using Serum Proteomic Profiling and Wavelet Features A Comparison of Machine Learning and Features Selection Algorithms
  • Oct 1, 2019
  • Journal of Clinical Engineering
  • Ali Mohammad Alqudah

Ovarian cancer is one the common cancers that in women such pathological disease within an organ might lead to noticeable changes in the proteomic patterns in serum. Mass spectrometry is the most important tool to understand the proteomic profiles proteomic changes; mass spectrometry extracts complex and informative functional data; and the most significant features of it are the peaks. This article presents a comparison of 4 widely used machine learning (ML) algorithms and 2 feature selection algorithms. The ML algorithms were applied on low-resolution surface-enhanced laser desorption/ionization–time-of-flight data sets for ovarian cancer diagnosis, by extracting wavelet features from spectrometer data and feeding them to the classifiers. The comparison is done by fusion of both selected features using the different algorithms with the classifiers, and then they were compared by measuring their classification test accuracy, sensitivity, and specificity values. Results show that all the presented ML algorithms performed well, with different feature selection algorithms all exceeding 90% accuracy.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 29
  • 10.1371/journal.pone.0301541
Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data.
  • Apr 18, 2024
  • PLOS ONE
  • Haohui Lu + 1 more

Many individual studies in the literature observed the superiority of tree-based machine learning (ML) algorithms. However, the current body of literature lacks statistical validation of this superiority. This study addresses this gap by employing five ML algorithms on 200 open-access datasets from a wide range of research contexts to statistically confirm the superiority of tree-based ML algorithms over their counterparts. Specifically, it examines two tree-based ML (Decision tree and Random forest) and three non-tree-based ML (Support vector machine, Logistic regression and k-nearest neighbour) algorithms. Results from paired-sample t-tests show that both tree-based ML algorithms reveal better performance than each non-tree-based ML algorithm for the four ML performance measures (accuracy, precision, recall and F1 score) considered in this study, each at p<0.001 significance level. This performance superiority is consistent across both the model development and test phases. This study also used paired-sample t-tests for the subsets of the research datasets from disease prediction (66) and university-ranking (50) research contexts for further validation. The observed superiority of the tree-based ML algorithms remains valid for these subsets. Tree-based ML algorithms significantly outperformed non-tree-based algorithms for these two research contexts for all four performance measures. We discuss the research implications of these findings in detail in this article.

  • Conference Article
  • 10.1109/latincom53176.2021.9647813
Predicting Video Bitrate from Encrypted Streaming Traffic in SDN-based 5G Networks with ML
  • Nov 17, 2021
  • Diego Figueiredo + 4 more

Video streaming applications currently dominate mobile network traffic. This predominance motivates network operators to optimize the network aiming at the quality of experience (QoE) of video streaming users. However, due to the widely adopted end-to-end traffic encryption, several key video-QoE indicators (KQI) that are useful to infer QoE are not readily available to the network operator. This work proposes a method to predict the video bitrate KQI from encrypted video streaming traffic. We compare the following machine learning (ML) algorithms for this task: Random Forest, Multilayer Perceptron, and Long Short-Term Memory networks. Since the only information available to the network operator are key performance indicators (KPIs), such as throughput, we use only information obtained from KPIs in our evaluations. We implement an open-source emulation setup for networking experiments to generate the data to train ML algorithms and a feature engineering procedure to obtain statistical features from the raw KPI data. Furthermore, we evaluate the proposed method in two ML learning cases: offline and incremental, and discuss issues regarding the generalization capability of the ML algorithms.

  • Book Chapter
  • Cite Count Icon 2
  • 10.5772/9153
On The Combination of Feature and Instance Selection
  • Feb 1, 2010
  • Jerffeson Teixeira de Souza + 2 more

In the last decades, huge amounts of data became omnipresent in diverse areas of knowledge, such as business, astronomy, biology, and so on. Machine Learning and Knowledge Discovery in Databases (KDD) are fields in Computer Science that focus on the task of transforming these data into useful knowledge. In (Fayyad et al., 1996), KDD is defined as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. Feature and Instance Selection belong to the practice of data preparation (or pre-processing), which is a preliminary process that transforms raw data into a format that is convenient to the data mining (or machine learning) algorithm. Usually, data is stored in a table-like format: the columns of these tables are the attributes or features they describe the data and the rows, or lines, are the records or instances they are the examples of the concept stored in the data. Feature and Instance selection processes allow applications, such as classification or clusterization, to focus only on the important (or relevant) attributes and records to the specific concept that is in study. As important machine learning problems, Feature and Instance Selection have been studied systematically over the last decades, when several algorithms for solving them individually have been proposed. Such selection problems play a fundamental role in the pre-processing step of any learning task. By removing noise, irrelevant and redundant features and instances, and reducing the overall dimensionality of a dataset, feature and instance selection have been demonstrated to improve the performance of most machine learning algorithms, speed up the output of models and allow algorithms to deal with datasets whose sizes are gigantic. Even though the specialized literature have exhibited remarkable results in solving both the feature and instance selection problems individually, little work has been done to manage these solutions to work together in order to solve these related problems simultaneously or even understand the relationship between features and instances. This chapter initially discusses the feature and instance selection problems and their relevance to machine learning, giving an accurate definition of both problems. Next, it surveys different approaches for dealing with feature selection and instance selection separately and some works that tried to integrate the solutions for these two problems, 9

  • Research Article
  • Cite Count Icon 2
  • 10.1001/jamanetworkopen.2024.32990
Availability of Evidence for Predictive Machine Learning Algorithms in Primary Care
  • Sep 12, 2024
  • JAMA Network Open
  • Margot M Rakers + 10 more

The aging and multimorbid population and health personnel shortages pose a substantial burden on primary health care. While predictive machine learning (ML) algorithms have the potential to address these challenges, concerns include transparency and insufficient reporting of model validation and effectiveness of the implementation in the clinical workflow. To systematically identify predictive ML algorithms implemented in primary care from peer-reviewed literature and US Food and Drug Administration (FDA) and Conformité Européene (CE) registration databases and to ascertain the public availability of evidence, including peer-reviewed literature, gray literature, and technical reports across the artificial intelligence (AI) life cycle. PubMed, Embase, Web of Science, Cochrane Library, Emcare, Academic Search Premier, IEEE Xplore, ACM Digital Library, MathSciNet, AAAI.org (Association for the Advancement of Artificial Intelligence), arXiv, Epistemonikos, PsycINFO, and Google Scholar were searched for studies published between January 2000 and July 2023, with search terms that were related to AI, primary care, and implementation. The search extended to CE-marked or FDA-approved predictive ML algorithms obtained from relevant registration databases. Three reviewers gathered subsequent evidence involving strategies such as product searches, exploration of references, manufacturer website visits, and direct inquiries to authors and product owners. The extent to which the evidence for each predictive ML algorithm aligned with the Dutch AI predictive algorithm (AIPA) guideline requirements was assessed per AI life cycle phase, producing evidence availability scores. The systematic search identified 43 predictive ML algorithms, of which 25 were commercially available and CE-marked or FDA-approved. The predictive ML algorithms spanned multiple clinical domains, but most (27 [63%]) focused on cardiovascular diseases and diabetes. Most (35 [81%]) were published within the past 5 years. The availability of evidence varied across different phases of the predictive ML algorithm life cycle, with evidence being reported the least for phase 1 (preparation) and phase 5 (impact assessment) (19% and 30%, respectively). Twelve (28%) predictive ML algorithms achieved approximately half of their maximum individual evidence availability score. Overall, predictive ML algorithms from peer-reviewed literature showed higher evidence availability compared with those from FDA-approved or CE-marked databases (45% vs 29%). The findings indicate an urgent need to improve the availability of evidence regarding the predictive ML algorithms' quality criteria. Adopting the Dutch AIPA guideline could facilitate transparent and consistent reporting of the quality criteria that could foster trust among end users and facilitating large-scale implementation.

  • Preprint Article
  • 10.5194/epsc2020-963
Investigating Machine Learning as a Basis for Asteroid Taxnomies in the 3-Micron Spectral Region
  • May 2, 2024
  • Matthew Richardson + 2 more

Abstract:As part of a larger study to elucidate the presence of hydrated minerals on asteroid surfaces, we are developing a robust taxonomic classification system using spectroscopic observations in the vicinity of 3 &amp;#956;m. We have constructed a Python algorithm to identify band centers and band depths near 3 &amp;#181;m for a set of normalized, thermally-corrected asteroid spectra for use to serve as inputs to Python&amp;#8217;s Scikit-Learn library of Machine Learning (ML) algorithms. We anticipate a thorough investigation of both Principal Component Analysis and ML (supervised, unsupervised, and Artificial Neural Network) techniques to assess which technique is likely to be better suited for classifying the 3-&amp;#181;m data. At this writing, we have run tests using Python&amp;#8217;s Agglomerative clustering ML algorithm to examine possible clustering scenarios. These initial steps have given us some familiarity with the mechanics of using ML on the 3-&amp;#181;m dataset as well as serving to identify some possible pitfalls or cul-de-sacs. Presented here are the preliminary results we have obtained.Introduction:Although various techniques have been used, asteroid classification has typically been done via Principal Component Analysis (PCA: [1,2]). PCA is a statistical technique that reduces the dimensionality of a dataset by identifying the most important parameters within a dataset based on their variance. Parameters that exhibit the greatest amount of variance are considered to be of greater importance while parameters with the least amount of variance are considered to be of lower importance. While the PCA technique produces better visualizations of the data by reducing the dimensionality of a dataset, the PCA technique comes with some drawbacks. Disadvantages such as its dependence on scale and information loss due to the orthogonal property of PCA can cause interpretation of PCA results to prove to be a more critical and time-consuming process. Therefore, exploring other means of classification may prove to be worthwhile.Machine Learning (ML) algorithms have had a significant impact on the way in which data is analyzed and interpreted, and have already proven to be a powerfully reliable resource in the field of planetary science. Accordingly, the application of ML to an asteroid taxonomy has the potential to be more efficient, objective, and easy-to-implement than PCA. ML algorithms can be supervised, in which the program &amp;#8220;learns&amp;#8221; from training data and is able to classify new inputs, or unsupervised, in which the program analyzes the dataset to determine patterns such as clusters. [3] used an Artificial Neural Network (ANN, a subset of ML) to classify asteroids, work followed up by [4]. Recent explorations of supervised ML for asteroid taxonomy are promising, and have applied training sets from existing databases to new visible and/or NIR photometric data (e.g. [5,6,7]).We seek to explore the benefits of ML algorithms, as well as compare and contrast to the PCA technique, in the production of an asteroid taxonomy. Our initial exploration has utilized a set of normalized, thermally-corrected asteroid spectra in the vicinity of 3 &amp;#181;m. We have identified band centers and band depths and served this parameter space as inputs to Python&amp;#8217;s Agglomerative clustering ML algorithm.Methodology:Thermal corrections of the asteroid spectra were performed via a forward model that uses a modified version of the Standard Thermal Model (STM: [8]). The forward model treats the beaming parameter as a free parameter adjusting its value for each iteration of the STM until it converges onto a value that yields expected long-wavelength continuum behavior. Spectra were then normalized to unity at a wavelength of 2.3 &amp;#181;m, followed by identification of band centers and band depths near 3 &amp;#181;m using both polynomial and Gaussian fits. In addition, band depths were measured at wavelengths of 2.9 &amp;#181;m and 3.2 &amp;#181;m to gather more information on asteroid band shapes. Lastly, the aforementioned calculated spectral features were input into Python&amp;#8217;s Agglomerative clustering algorithm to determine which asteroid spectra shared similar features.Summary:As part of a larger investigation to better understand hydrated mineralogies as they apply to asteroids, we have begun work towards developing a quantitative taxonomic framework derived from asteroid spectra in the wavelength range from 2.0-4.0 &amp;#181;m. Our exploration thus far of Python&amp;#8217;s Agglomerative clustering algorithm has proven to be fruitful. Minor changes to the parameterization of this algorithm can yield very different results, which naturally can lead to different interpretations. The Agglomerative clustering algorithm is one of many the powerful ML algorithms we will explore against the PCA technique, all of which we will be discussing in our presentation.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3389/fnut.2022.740898
Are Machine Learning Algorithms More Accurate in Predicting Vegetable and Fruit Consumption Than Traditional Statistical Models? An Exploratory Analysis
  • Feb 17, 2022
  • Frontiers in Nutrition
  • Mélina Côté + 8 more

Machine learning (ML) algorithms may help better understand the complex interactions among factors that influence dietary choices and behaviors. The aim of this study was to explore whether ML algorithms are more accurate than traditional statistical models in predicting vegetable and fruit (VF) consumption. A large array of features (2,452 features from 525 variables) encompassing individual and environmental information related to dietary habits and food choices in a sample of 1,147 French-speaking adult men and women was used for the purpose of this study. Adequate VF consumption, which was defined as 5 servings/d or more, was measured by averaging data from three web-based 24 h recalls and used as the outcome to predict. Nine classification ML algorithms were compared to two traditional statistical predictive models, logistic regression and penalized regression (Lasso). The performance of the predictive ML algorithms was tested after the implementation of adjustments, including normalizing the data, as well as in a series of sensitivity analyses such as using VF consumption obtained from a web-based food frequency questionnaire (wFFQ) and applying a feature selection algorithm in an attempt to reduce overfitting. Logistic regression and Lasso predicted adequate VF consumption with an accuracy of 0.64 (95% confidence interval [CI]: 0.58–0.70) and 0.64 (95%CI: 0.60–0.68) respectively. Among the ML algorithms tested, the most accurate algorithms to predict adequate VF consumption were the support vector machine (SVM) with either a radial basis kernel or a sigmoid kernel, both with an accuracy of 0.65 (95%CI: 0.59–0.71). The least accurate ML algorithm was the SVM with a linear kernel with an accuracy of 0.55 (95%CI: 0.49–0.61). Using dietary intake data from the wFFQ and applying a feature selection algorithm had little to no impact on the performance of the algorithms. In summary, ML algorithms and traditional statistical models predicted adequate VF consumption with similar accuracies among adults. These results suggest that additional research is needed to explore further the true potential of ML in predicting dietary behaviours that are determined by complex interactions among several individual, social and environmental factors.

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.isprsjprs.2023.05.015
Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data
  • May 24, 2023
  • ISPRS Journal of Photogrammetry and Remote Sensing
  • Hamid Ebrahimy + 2 more

Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data

  • Research Article
  • Cite Count Icon 84
  • 10.3906/elk-1611-235
Improvement of heart attack prediction by the feature selection methods
  • Jan 1, 2018
  • TURKISH JOURNAL OF ELECTRICAL ENGINEERING &amp; COMPUTER SCIENCES
  • Hidayet Takci

Prediction of a heart attack is very important since it is one of the leading causes of sudden death, especially in low-income countries. Although cardiologists use traditional clinical methods such as electrocardiography and blood tests for heart attack prediction, computer aided diagnosis systems that use machine learning methods are also in use for this task. In this study, we used machine learning and feature selection algorithms together. Our aim is to determine the best machine learning method and the best feature selection algorithm to predict heart attacks. For this purpose, many machine learning methods with optimum parameters and several feature selection methods were used and evaluated on the Statlog (Heart) dataset. According to the experimental results, the best machine learning algorithm is the support vector machine algorithm with the linear kernel, while the best feature selection algorithm is the reliefF method. This pair gave the highest accuracy value of 84.81%.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon