Drivers of potential policyholders’ uptake of insurance in Kenya using Random Forest
The low adoption of insurance by potential policyholders in developing countries like Kenya is a cause for concern for insurers, regulators, and other marketing stakeholders. To effectively design targeted marketing strategies to boost insurance adoption, it is crucial to determine the factors that affect insurance uptake among potential policyholders. In this study, the 2021 FinAccess Survey, which interviewed sampled individuals above 16 years in Kenya and machine learning techniques, including Random Forest, XGBoost, and Logistic Regression, were utilized to uncover the factors driving insurance uptake and the reasons for the low adoption of insurance among potential policyholders. Random Forest was the most robust model of the three classifiers based on Kappa score, recall score, F1 score, precision, and area under the operating characteristic curve (approaching 1). The paper explores eight reasons why people currently do not have insurance policies. The results indicated that affordability was the primary driver of uptake with 68.67% of having expressed a desire to possess insurance but are unable to afford it. The highest level of education being the next most significant factor. Cultural and religious beliefs and mistrust of insurance providers were found to have a minimal impact on uptake. These findings imply that offering affordable insurance products and conducting awareness campaigns are critical to increase insurance adoption.
- Preprint Article
- 10.5194/egusphere-egu24-1200
- Nov 27, 2024
In recent years, the exploration of exoplanets has gained momentum due to the increasing volume of data collected from missions like Kepler. Machine learning (ML) techniques have proven to be valuable tools for efficiently analyzing and classifying exoplanet candidates. This study focuses on the application of ML models, specifically Random Forest and Gaussian methods, to identify exoplanets using the light curves obtained from Kepler's archived data.The research aims to develop accurate and robust models capable of distinguishing exoplanets from other celestial objects. Feature engineering techniques are employed to extract relevant information from the light curves, including transit depth, transit duration, and periodicity patterns. These features serve as inputs for both the Random Forest and Gaussian models, enabling them to learn and generalize from the training data.The Random Forest model, known for its ensemble-based approach, demonstrates exceptional performance in exoplanet identification. Its ability to capture complex relationships among features and make accurate predictions results in high precision and recall scores. On the other hand, the Gaussian method, which relies on probabilistic modeling, exhibits competitive results through a different classification approach.The performance of the Random Forest and Gaussian models is compared using comprehensive evaluation metrics such as accuracy, precision, recall, and F1 score. The results indicate that the Random Forest model outperforms the Gaussian method in terms of precision and recall. This highlights the effectiveness of ensemble-based ML techniques for exoplanet identification tasks.In conclusion, this study successfully demonstrates the utilization of ML models, specifically Random Forest and Gaussian methods, for exoplanet identification using Kepler's archived data and light curves. The Random Forest model emerges as the superior choice, achieving higher accuracy and recall rates in distinguishing exoplanets from other celestial objects. These findings contribute to the advancement of exoplanet research and pave the way for the development of more precise and efficient identification methods in the future.
- Research Article
1
- 10.15294/sji.v11i3.11068
- Oct 22, 2024
- Scientific Journal of Informatics
Purpose: This study proposes to evaluate the effectiveness of Random Forest (RF) compared to Classification and Regression Trees (CART) in prediction of hotel star ratings. The objective is to identify the algorithm that provides the most reliable and accurate classification outcomes based on diverse hotel attributes in accordance with the standard categorization of star hotel categories. This is necessary due to the important role of accurate star ratings in guiding consumer choices and enhancing competitive positioning in the hospitality industry. Method: This study conducted a comprehensive dataset about Hotel in Banyumas Regency, including location, facilities, the size of rooms, type of rooms, price of rooms, and customer reviews, subjected to training through both RF and CART algorithms. Both algorithms are evaluated using accuracy, precision, recall, and F1 score. Additionally, both algorithms due to in the same preprocessing while performing hyperparameter tuning improve the efficacy of each model. Result: The results showed that RF achieved the best overall accuracy and robustness than CART across all tests conducted. Furthermore, RF also outperformed CART in classification effectiveness among classes, including enhanced precision and recall scores across multiple stars rating categories, signifying increased generalization and consistency in classification tasks. RF classifier consistently surpassed the CART classifier in terms of both accuracy and F1-score throughout all random states and test sizes, with a highest score of 0.9932 at a random state of 100 and a test size of 0.4. The most reliable results were obtained using RF with 42 random states and a test size of 0.2, resulting in an accuracy of 0.9909, precision of 1.0, recall of 1.0, and F1 score of 1.0. Simultaneously, CART shows values of 0.9818, 1.0, 1.0, and 1.0, respectively, while maintaining the same variation. This consistent performance, regardless of fluctuations, illustrates the robustness and suitability of RF for classification tasks compared to CART. Novelty: This study offered new insights about the implementation of machine learning about hotel star rating predictions using RF and CART algorithms. Also, the novelty of the collected hotel dataset used in this study. A detailed comparative analysis was also provided, contributing to the existing literature by showing the effectiveness of RF over CART for this specific application. Future studies could explore the integration of additional machine learning methods to further enhance prediction accuracy and operational efficiency in the hospitality industry.
- Conference Article
11
- 10.1109/tensymp54529.2022.9864490
- Jul 1, 2022
The backbone of India's economy is Agriculture. There is an increased requirement to predict the future crop yield to match the crop demands. Farmers want to know which crop to plant and approximate yield in advance. However, unpredictable rainfall trends, seasonal production trends, and multiple climatic aspects make it challenging to recommend crops and predict yield. Machine learning techniques can resolve the issue. In this paper, we have approached the problem using two models. One of the parts of the model focuses on predicting the crop yield in advance by analyzing factors like district, season, geoclimatic conditions, soil, and crop type. It will help the farmers and the government make agricultural risk management and pricing decisions to max-imize profit. In the models, data pre-processing includes eliminating null values, feature selection and elimination, choosing independent and dependent variables, encoding the categorical variables, and finally splitting the dataset. Random Forest Regressor and Decision Tree Regressor are used for prediction and the metrics used were Accuracy, R2, Adjusted R2 and Residual Standard Deviation. Naive Bayes Classifier, Decision Tree Classifier, KNN Classifier, Random Forest Classifier, Gradient Boosting and XG Boosting were used for the crop suggestion model and the performance metrics used were Accuracy, Precision, Recall and F1 Score. Finally, Random Forest Regressor is considered for crop prediction with an accuracy of 89 % and Random Forest Classifier for crop suggestion with the accuracy of 98%.
- Research Article
5
- 10.3390/s24196177
- Sep 24, 2024
- Sensors (Basel, Switzerland)
This study develops a hybrid machine learning (ML) algorithm integrated with IoT technology to improve the accuracy and efficiency of soil monitoring and tomato crop disease prediction in Anakapalle, a south Indian station. An IoT device collected one-minute and critical soil parameters—humidity, temperature, pH values, nitrogen (N), phosphorus (P), and potassium (K), during the vegetative growth stage, which are essential for assessing soil health and optimizing crop growth. Kendall’s correlations were computed to rank these parameters for utilization in hybrid ML techniques. Various ML algorithms including K-nearest neighbors (KNN), support vector machines (SVM), decision tree (DT), random forest (RF), and logistic regression (LR) were evaluated. A novel hybrid algorithm, ‘Bayesian optimization with KNN’, was introduced to combine multiple ML techniques and enhance predictive performance. The hybrid algorithm demonstrated superior results with 95% accuracy, precision, and recall, and an F1 score of 94%, while individual ML algorithms achieved varying results: KNN (80% accuracy), SVM (82%), DT (77%), RF (80%), and LR (81%) with differing precision, recall, and F1 scores. This hybrid ML approach proved highly effective in predicting tomato crop diseases in natural environments, underscoring the synergistic benefits of IoT and advanced ML techniques in optimizing agricultural practices.
- Dissertation
- 10.31979/etd.re7j-zjrm
- Feb 24, 2021
The application of machine learning (ML) techniques to simulated cosmological data aids in the development of predictive theories of galaxy formation, evolution, and the nature of dark matter (DM) in the Universe. We present the results of a simple binary classification model for predicting the dark matter fraction (DMF) of simulated galaxies using ML techniques such as principal component analysis and random forest (RF) classifier algorithms. The source of the data was The Next Generation Illustris (IllustrisTNG) simulations, which is a series of gravo-magneto-hydrodynamical simulations of the mock Universe. The data consisted of a class distribution imbalanced dataset of 2446 high mass satellite galaxies (i.e., stellar masses ≥ 109 M☉) from the twenty-two most massive simulated galaxy clusters (i.e., total cluster masses > 1014 M☉) in IllustrisTNG. The RF classifier model was trained on simulated galaxy properties (e.g., masses, metallicities, color) and makes predictions on DMF classification labels for classifying galaxies as either DM rich or DM poor (based on a DMF threshold value of 0.8). The RF classifier had an overall accuracy and ROC-AUC score of 92.15% and ∼90%, respectively. The RF predictions for the DM rich majority class had a precision, recall, and F1 score of 93%, 97%, and 95%, respectively. The DM poor minority class, on the other hand, had a precision, recall, and F1 score of 91%, 83%, and 87%, respectively. Thus, the results show that ML classifiers can be employed as novel analytical tools to “measure” hidden galaxy properties, such as the DMF, from simple observable properties with satisfactory results. Furthermore, employing more complex ML algorithms and data sources (e.g., observational data, EAGLE simulations, additional galaxy properties) could help improve the predictive power of the RF model and help gain insights into the DM stripping pathways in galaxy cluster environments.
- Research Article
- 10.31559/glm2025.15.2.1
- Jun 1, 2025
- General Letters in Mathematics
This study aims to compare the performance of four popular machine learning algorithms in the field of Classification for lung cancer diagnosis, namely: Decision Tree, Random Forest, Support Vector Machines (SVM), and Logistic Regression: Decision Tree, Random Forest, Support Vector Machines (SVM), Support Vector Machines (SVM), and Logistic Regression. The importance of this study comes from the need to improve the accuracy of lung cancer diagnosis, which is one of the most common and dangerous cancers, using machine learning techniques that can deal with complex and multidimensional data. A dataset containing information about patients, such as age, smoking habits, chemical exposure, and other influencing factors, was used. The performance of the algorithms was evaluated using multiple metrics, including Accuracy, Precision, Recall, and F1 Score. The results showed that the Random Forest algorithm achieved the highest accuracy and best performance in dealing with complex data, while Logistic Regression showed a good ability to interpret influential factors and provide analytical insights. Based on these results, the study recommends the use of the Random Forest algorithm in lung cancer diagnosis applications that require high accuracy, considering the role of logistic regression in analyzing influencing factors. It also recommends the importance of exploring additional improvements to the algorithms to increase their effectiveness in dealing with larger and more complex datasets.
- Research Article
24
- 10.1038/s41374-021-00662-x
- Mar 1, 2022
- Laboratory Investigation
Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models
- Research Article
5
- 10.3389/fpsyt.2023.1266548
- Dec 21, 2023
- Frontiers in Psychiatry
Bipolar disorder (BD) is a chronically progressive mental condition, associated with a reduced quality of life and greater disability. Patient admissions are preventable events with a considerable impact on global functioning and social adjustment. While machine learning (ML) approaches have proven prediction ability in other diseases, little is known about their utility to predict patient admissions in this pathology. To develop prediction models for hospital admission/readmission within 5 years of diagnosis in patients with BD using ML techniques. The study utilized data from patients diagnosed with BD in a major healthcare organization in Colombia. Candidate predictors were selected from Electronic Health Records (EHRs) and included sociodemographic and clinical variables. ML algorithms, including Decision Trees, Random Forests, Logistic Regressions, and Support Vector Machines, were used to predict patient admission or readmission. Survival models, including a penalized Cox Model and Random Survival Forest, were used to predict time to admission and first readmission. Model performance was evaluated using accuracy, precision, recall, F1 score, area under the receiver operating characteristic curve (AUC) and concordance index. The admission dataset included 2,726 BD patients, with 354 admissions, while the readmission dataset included 352 patients, with almost half being readmitted. The best-performing model for predicting admission was the Random Forest, with an accuracy score of 0.951 and an AUC of 0.98. The variables with the greatest predictive power in the Recursive Feature Elimination (RFE) importance analysis were the number of psychiatric emergency visits, the number of outpatient follow-up appointments and age. Survival models showed similar results, with the Random Survival Forest performing best, achieving an AUC of 0.95. However, the prediction models for patient readmission had poorer performance, with the Random Forest model being again the best performer but with an AUC below 0.70. ML models, particularly the Random Forest model, outperformed traditional statistical techniques for admission prediction. However, readmission prediction models had poorer performance. This study demonstrates the potential of ML techniques in improving prediction accuracy for BD patient admissions.
- Research Article
- 10.12732/ijam.v38i3s.699
- Oct 13, 2025
- International Journal of Applied Mathematics
This study applies machine learning (ML) techniques to model and predict Assam’s agricultural Gross State Domestic Product (GSDP). Three predictive models—multiple linear regression, random forest regression, and gradient boosting—are evaluated. The random forest model achieved the best fit, exhibiting the highest R² and the lowest mean squared error (MSE) and Akaike information criterion (AIC), along with statistically significant coefficients. Ensemble methods (random forest and gradient boosting) markedly improve forecast accuracy of agricultural growth trends compared to traditional regression, yielding more reliable predictions of productivity and GSDP contributions. The findings underscore the vital role of agricultural productivity in driving economic growth, strengthening GSDP, and supporting food security and employment. Integrating advanced ML techniques with statistical analysis provides insights for policymakers to make data-driven decisions that foster sustainable agricultural development and economic prosperity in Assam. Objectives: Predict Assam’s agricultural sector performance using selected machine learning models. Evaluate and compare the effectiveness of these models in assessing the state’s agricultural economy. Methods: Data preprocessing involved handling outliers (using interquartile range and mean-max scaling) and feature selection via correlation heatmaps. Predictive models (multiple linear regression, random forest regression, and gradient boosting) were implemented in Python. Results: The gradient boosting model emerged as the most effective, achieving the highest accuracy and generalization (testing R² = 0.9867). Farm area, labour, maize yield, and autumn rice yield were the most significant positive contributors to GSDP. The random forest model performed similarly well (R² = 0.9867), while the multiple linear regression model was least accurate (R² = 0.9521), likely due to its inability to capture nonlinear relationships. Conclusions: Machine learning models offer transformative potential for Assam’s agricultural sector. Leveraging data-driven insights from these models can empower policymakers to design targeted interventions, promoting inclusive and sustainable economic growth in the region.
- Research Article
15
- 10.3389/fmtec.2022.855208
- Jul 22, 2022
- Frontiers in Manufacturing Technology
Electrical, metal, plastic, and food manufacturing are among the major energy-consuming industries in the U.S. Since 1981, the U.S. Department of Energy Industrial Assessments Centers (IACs) have conducted audits to track and analyze energy data across several industries and provided recommendations for improving energy efficiency. In this article, we used statistical and machine learning techniques to draw insights from this IAC dataset with over 15,000 samples collected from 1981 to 2013. We developed predictive models for energy consumption using machine learning techniques such as Multiple Linear Regression, Random Forest Regressor, Decision Tree Regressor, and Extreme Gradient Boost Regressor. We also developed classifier models using Support Vector Machines, Random Forest, K-Nearest Neighbor (KNN), and deep learning. Results using this data set indicate that Random Forest Regressor is the best prediction technique with an R2 of 0.869, and the Random Forest classifier is the best technique with precision, recall, F1 score, and accuracy of 0.818, 0.884, 0.844, and 0.883, respectively. Deep learning also performed competitively with an accuracy of about 0.88 in training and testing after 10 epochs. The machine learning models could be useful in benchmarking the energy consumption of factories and identifying opportunities to improve energy efficiency.
- Research Article
- 10.3390/oral4030032
- Sep 13, 2024
- Oral
Purpose: The purpose of this study is to assess the effectiveness of the best performing interpretable machine learning models in the diagnoses of leukoplakia and oral squamous cell carcinoma (OSCC). Methods: A total of 237 patient cases were analysed that included information about patient demographics, lesion characteristics, and lifestyle factors, such as age, gender, tobacco use, and lesion size. The dataset was preprocessed and normalised, and then separated into training and testing sets. The following models were tested: K-Nearest Neighbours (KNN), Logistic Regression, Naive Bayes, Support Vector Machine (SVM), and Random Forest. The overall accuracy, Kappa score, class-specific precision, recall, and F1 score were used to assess performance. SHAP (SHapley Additive ExPlanations) was used to interpret the Random Forest model and determine the contribution of each feature to the predictions. Results: The Random Forest model had the best overall accuracy (93%) and Kappa score (0.90). For OSCC, it had a precision of 0.91, a recall of 1.00, and an F1 score of 0.95. The model had a precision of 1.00, recall of 0.78, and F1 score of 0.88 for leukoplakia without dysplasia. The precision for leukoplakia with dysplasia was 0.91, the recall was 1.00, and the F1 score was 0.95. The top three features influencing the prediction of leukoplakia with dysplasia are buccal mucosa localisation, ages greater than 60 years, and larger lesions. For leukoplakia without dysplasia, the key features are gingival localisation, larger lesions, and tongue localisation. In the case of OSCC, gingival localisation, floor-of-mouth localisation, and buccal mucosa localisation are the most influential features. Conclusions: The Random Forest model outperformed the other machine learning models in diagnosing oral cancer and potentially malignant oral lesions with higher accuracy and interpretability. The machine learning models struggled to identify dysplastic changes. Using SHAP improves the understanding of the importance of features, facilitating early diagnosis and possibly reducing mortality rates. The model notably indicated that lesions on the floor of the mouth were highly unlikely to be dysplastic, instead showing one of the highest probabilities for being OSCC.
- Research Article
7
- 10.14569/ijacsa.2022.0130270
- Jan 1, 2022
- International Journal of Advanced Computer Science and Applications
The Internet of Medical Things was immensely implemented in healthcare systems during the covid 19 pandemic to enhance the patient's circumstances remotely in critical care units while keeping the medical staff safe from being infected. However, Healthcare systems were severely affected by ransomware attacks that may override data or lock systems from caregivers' access. In this work, after obtaining the required approval, we have got a real medical dataset from actual critical care units. For the sake of research, a portion of data was used, transformed, and manifested using laboratory-made payload ransomware and successfully labeled. The detection mechanism adopted supervised machine learning techniques of K Nearest Neighbor, Support Vector Machine, Decision Trees, Random Forest, and Logistic Regression in contrast with deep learning technique of Artificial Neural Network. The methods of KNN, SVM, and DT successfully detected ransomware's signature with an accuracy of 100%. However, ANN detected the signature with an accuracy of 99.9%. The results of this work were validated using precision, recall, and f1 score metrics.
- Research Article
5
- 10.1007/s11517-023-02802-5
- Apr 3, 2023
- Medical & Biological Engineering & Computing
Noise and artifacts affect strongly the quality of the electrocardiogram (ECG) in long-term ECG monitoring (LTM), making some of its parts impractical for diagnosis. The clinical severity of noise defines a qualitative quality score according to the manner clinicians make the interpretation of the ECG, in contrast to assess noise from a quantitative standpoint. So clinical noise refers to a scale of different levels of qualitative severity of noise which aims at elucidating which ECG fragments are valid to achieve diagnosis from a clinical point of view, unlike the traditional approach, which assesses noise in terms of quantitative severity. This work proposes the use of machine learning (ML) techniques to categorize different qualitative noise severity using a database annotated according to a clinical noise taxonomy as gold standard. A comparative study is carried out using five representative ML methods, namely, K neareast neighbors, decision trees, support vector machine, single-layer perceptron, and random forest. The models are fed by signal quality indexes characterizing the waveform in time and frequency domains, as well as from a statistical viewpoint, to distinguish between clinically valid ECG segments from invalid ones. A solid methodology to prevent overfitting to both the dataset and the patient is developed, taking into account balance of classes, patient separation, and patient rotation in the test set. All the proposed learning systems have demonstrated good classification performance, attaining a recall, precision, and F1 score up to 0.78, 0.80, and 0.77, respectively, in the test set by a single-layer perceptron approach. These systems provide a classification solution for assessing the clinical quality of the ECG taken from LTM recordings.Graphical Clinical Noise Severity Classification based on Machine Learning techniques towards Long-Term ECG Monitoring
- Conference Article
- 10.1115/gtindia2023-118382
- Dec 7, 2023
Gas Turbine is a mechanical system which is used for power generation since decades. Its components are highly stressed and exposed to very high temperature. It operates on various environmental condition which experience numerous loading patterns. In all these operating conditions, Gas turbines must operate with higher efficiency to meet changing power requirements. Each component must be assessed for different mechanical failures and predict life during design and development. Gas Turbine Rotors which comprise of several discs are one of the most critical components in gas turbine. GT rotor is subjected to high temperature and centrifugal load during their operations. In such conditions, predicting the failures becomes utmost priority. This study aimed to evaluate the effectiveness of Machine Learning (ML) techniques to predict the failure in a gas turbine Rotor Disc. For this study, a simplified geometrical 3D model of Turbine Disc was used along with Cooling holes, Hub geometry with Sealing arm and lifting features. Four input variables i.e., cooling hole mass flow, temperature of cooling air, purge hole mass flow and temperatures of purge air were used as input feature to the machine learning model. Steady-state thermo-mechanical analysis were performed to evaluate the metal temperature and subsequently the stress and life of the component under various load cases. A machine learning based surrogate model was developed based on the data extracted from 3D thermo-mechanical FEA assessment. The generated dataset was randomly divided into 75:25 ratio for training and testing of ML models respectively. Multiple models based on different algorithms were created for predicting disc LCF life. Then, these were evaluated with test set to select the model using various Evaluation Metric. Machine Learning techniques such as Logistic Regression, Random Forest and Support Vector Machine algorithms were compared using Precision, Recall and F1 Score. The results were validated through Confusion Matrix and ROC (Receiver Operating Characteristics) curves. This study demonstrates that ML techniques has potential in predicting failures/life of a component. This study will be helpful for the assessments performed during later stage of product life cycle for example overhaul, lifetime extension or if any manufacturing deviations happen. Mostly these assessments are very critical and time dependent. As we use conventional methods for component assessments, it takes significant time and cost. Therefore, this study and its implementation would make current industrial practices efficient and cost effective.
- Research Article
- 10.52756/ijerr.2024.v45spl.005
- Nov 30, 2024
- International Journal of Experimental Research and Review
Analyzing user interface (UI) bugs is an important step taken by testers and developers to assess the usability of the software product. UI bug classification helps in understanding the nature and cause of software failures. Manually classifying thousands of bugs is an inefficient and tedious job for both testers and developers. Objective of this research is to develop a classification model for the User Interface (UI) related bugs using supervised Machine Learning (ML) algorithms and Natural Language Processing (NLP) techniques. Also, to assess the effect of different sampling and feature vectorization techniques on the performance of ML algorithms. Classification is based upon ‘Summary’ feature of the bug report and utilizes six classifiers i.e., Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes (MNB), Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF) and Gradient Boosting (GB). Dataset obtained is vectored using two vectorization techniques of NLP i.e., Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). ML models are trained after vectorization and data balancing. The models ' hyperparameter tuning (HT) has also been done using the grid search approach to improve their efficacy. This work provides a comparative performance analysis of ML techniques using Accuracy, Precision, Recall and F1 Score. Performance results showed that a UI bug classification model can be built by training a tuned SVM classifier using TF-IDF and SMOTE (Synthetic Minority Oversampling Techniques). SVM classifier provided the highest performance measure with Accuracy: 0.88, Precision: 0.86, Recall: 0.85 and F1: 0.85. Result also inferred that the performance of ML algorithms with TF-IDF is better than BoW in most cases. This work provides classification of bugs that are related to only the user interface. Also, the effect of two different feature extraction techniques and sampling techniques on algorithms were analyzed, adding novelty to the research work.
- Ask R Discovery
- Chat PDF