Modelos predictivos en la clasificación de donantes a plaquetoféresis sanguínea en el Hospital Nacional Edgardo Rebagliati Martins, Lima-Perú. 2022
The objective of the present investigation was to determine the predictive model that allows classifying donors for blood plateletpheresis at the Edgardo Rebagliati Martins-EsSalud National Hospital, Lima-Peru 2022, using a database of donors who came to donate platelets through plateletpheresis in the period 2015-2022. A descriptive, retrospective, and non-experimental research design was used. To evaluate the predictive models, the Python programming language was used using Google Colab, where the stages that involve the construction of a Machine Learning model were carried out, finding that the decision tree was the model with the best performance both when using unbalanced data (Precision=0.89; F1-Score=0.91; AUC=0.90) as balanced data using SMOTE (Precision=0.87; F1- Score=0.89; AUC=0.90), having a better predictive capacity in the classification of platelet donors versus other models.
- Research Article
5
- 10.52465/joscex.v3i1.61
- Mar 30, 2022
- Journal of Soft Computing Exploration
This study aims to predict whether the patient deserves to be inpatient or outpatient by comparing several machine learning techniques, namely, logistic regression, decision tree, neural network, random forest, gradient boosting. The research method uses three stages of research, namely data collection, data preprocessing, and data modeling. Implementation of program code using google colab and python programming language. The dataset used as the research sample is Electronic Health Record Predicting data. Based on the accuracy results generated in this study, the use of the Neural Network machine learning algorithm to predict hospitalization decisions for patients has proven to be a machine learning algorithm that has the highest accuracy rate reaching 74, 47% compared to other comparison machine learning algorithms, namely logistic regression, decision tree, neural network, random forest, gradient boosting.
- Research Article
10
- 10.34306/att.v7i1.440
- Feb 20, 2025
- Aptisi Transactions on Technopreneurship (ATT)
Travel pattern variations pose challenges in building a prediction model that accurately captures seasonal patterns or precision of BRT passenger numbers. An approach that integrates sophisticated prediction algorithms with high accuracy is needed to address the Transjakarta BRT passenger number prediction model problem. The proposed prediction model with the best accuracy is sought using deep learning on 8 models. The prediction model is used for short-term and long-term predictions, as well as looking for correlations in the prediction results of 13 Transjakarta corridors. The Python programming language with the Deep Learning Tensor Flow framework is run by Google Colaboratory used in the prediction simulation environment. The combination of BiLSTM-CNN was found to have the best accuracy of the evaluation value (SMAPE = 15.9387, MAPE = 0.598, and MSLE = 0.0425), although it has the longest time (134 seconds). Fluctuations in short-term predictions of passenger numbers evenly occur simultaneously across all corridors. Fluctuations in long-term predictions evenly occur simultaneously across all corridors, except in February. There is no negative correlation in the 13 prediction results and there are 8 corridors that have a close positive correlation. The prediction results can be used by transportation operators and the government to optimize resource planning and transportation policies to support sustainable community and economic mobility.
- Research Article
1
- 10.32628/ijsrst523103192
- Jun 10, 2023
- International Journal of Scientific Research in Science and Technology
An analysis has been carried out for the Spring Damper System using the Python programming language on Google Colab. The first stage that is carried out before the simulation is to determine the differential equation based on Newton's II law equation. The Python programming language was chosen because only needs to run in a browser, users can monitor the training process (or even coding) via a smartphone browser if the smartphone is connected to the same Google Drive. The simulation is carried out by varying the mass value from 5kg - 50kg with a mass increase range of 5kg. This is done to determine the effect of mass on changes in position and velocity/speed. Based on the simulation results, the greater the mass value given will affect the amplitude value, the position graph will increase while the velocity graph will decrease, and the time needed for both amplitudes to stabilize will increase.
- Research Article
- 10.37600/tekinkom.v7i2.1535
- Dec 31, 2024
- Jurnal Teknik Informasi dan Komputer (Tekinkom)
This study aims to compare the C4.5 Decision Tree and Naive Bayes algorithms in predicting heart disease to determine the most efficient algorithm. Heart disease is one of the leading causes of global mortality, including in Indonesia, due to vascular damage that disrupts the optimal functioning of the heart. The dataset used comes from the UCI Machine Learning Repository and the Kaggle website's "Heart Failure Prediction," totaling 918 records with 11 clinical attributes and 1 label. Data processing was conducted using Google Colab with the Python programming language. The results show that the C4.5 algorithm achieved an accuracy of 95.18% after feature selection using Particle Swarm Optimization (PSO), while without feature selection, it achieved an accuracy of 81%, precision of 83%, recall of 74%, F1-score of 78%, and an AUC value of 81%. Meanwhile, the Naive Bayes algorithm achieved a maximum accuracy of 90.87% without feature selection and performed best with an accuracy of 84%, precision of 83%, recall of 80%, F1-score of 81%, and an AUC value of 94%. These findings indicate that the Naive Bayes algorithm outperformed the C4.5 algorithm in several evaluation parameters.
- Research Article
- 10.24042/ajpm.v14i2.14276
- Dec 16, 2023
- Al-Jabar : Jurnal Pendidikan Matematika
Background: In the current era of technology, information security is increasingly important. The growth of technology leads to a higher level of threat to the security of data and information dissemination, and cryptography is a valuable protective tool.Aim: The primary objective of this research is to enhance text security through the fusion of the Vigenere cipher and the Rubik's cube algorithm. By leveraging this novel approach, we aim to fortify the confidentiality of textual data against potential eavesdroppers and adversaries. To demonstrate the practicality of this method, we perform a simulation using the Python programming language within the Google Colab environment. Method: This study employs a qualitative research methodology supplemented by empirical simulation. The combination of the Vigenere Cipher and the Rubik's Cube algorithm in a 4×4×4 configuration is implemented to encrypt and decrypt text. The simulation is executed using the Google Colab platform, enabling a practical illustration of the encryption process.Result: The results of our research indicate the feasibility of generating ciphertext through the amalgamation of the Vigenere Cipher and the Rubik's Cube algorithm in the specified 4×4×4 configuration. The simulation conducted in Google Colab serves as concrete evidence of the effectiveness and practicality of this combined encryption method.Conclusion: In conclusion, this research offers a compelling approach to bolstering text security in the modern era of information technology. By combining the Vigenere Cipher with the Rubik's Cube algorithm in a 4×4×4 configuration, we have demonstrated the potential to significantly enhance the confidentiality of sensitive textual data. The empirical simulation conducted in Google Colab reaffirms the practicality and viability of this innovative encryption technique, highlighting its potential as a valuable tool in the realm of information security.
- Research Article
- 10.36120/2587-3636.v35i1.87-96
- Mar 1, 2024
- Acta et commentationes: Științe ale Educației
The article offers theoretical and practical material on encryption and decryption of text messages using a codeword cipher. Is shown an example of the implementation of the algorithm for cryptographic transformations of the specified cipher in the Python programming language. The development and execution of the Python program code was carried out in a free interactive cloud environment — Google Colab.
- Preprint Article
- 10.21203/rs.3.rs-4926945/v1
- Oct 17, 2024
- Research Square
Background Children make up a large percentage of Coronavirus Disease 2019 (COVID-19) hospital admissions, but there is little information available about the features to predict the severity status of the illness or mortality in pediatrics. Logistic regression, supporting vector machine and ensemble machine learning algorithms were used to develop predictive models and identify prognostic factors for severity and mortality of COVID-19 in hospitalized children. Methods A total of 183 children with COVID-19 under the age of 18 years hospitalized in a referral hospital in Yazd province, Iran, from March 1, 2020 to August 1, 2021 were considered for this study. Logistic regression, and machine learning classifiers including supporting vector machine, decision tree, random forest, Bagging classifier trees, Gradient boosted decision trees, and Adaptive boost classifier trees were employed to predict the development of mild/severe or critical COVID-19 and death occurrence during hospitalization. Each model performance was assessed through five-fold cross-validation method, with evaluation metrics and area under the curve. In addition, the best clinical predictive models were used to identify significant factors between severe and non-severe groups, as well as between survivors and non-survivors. Results Seven predictive models were developed using the medical files of 183 hospitalized children, consisting of 94 and 89 (48.6%) in non-severe and severe groups, respectively, as well as 159 survivors and 24 (13%) non-survivors. In prediction of severity status, both decision tree and random forest algorithms had the highest accuracy of 73.3% and 68.7% to predict severity status in balanced data, respectively. Based on decision tree, respiratory distress and cough at the time of admission could be regarded as the as the key factors to estimate the likelihood of severity status. The results also showed that Gradient boosted decision trees, and Adaptive boost classifier trees had the best performance for mortality prediction in balanced data considering the accuracy of 88.8% and 87.7%, respectively. Cough at the time of admission, age group of 1–13 years old, and non-normal WBC could be considered as predictive factors for death occurrence. Conclusions This study indicated that tree-based classifiers were the best machine learning approaches for predicting severity status and mortality in hospitalized children with COVID-19. Clinical symptoms at the time of admission identified as the most predictive features though optimal algorithms.
- Research Article
- 10.21272/1817-9215.2022.4-19
- Jan 1, 2022
- Vìsnik Sumsʹkogo deržavnogo unìversitetu
In order to obtain high-quality predictive results of electricity consumption in the context of different countries and years, theoretical foundations and terminology regarding the use of "Decision Tree" models and their ensemble architecture "Random Forest" were considered. This architecture helped to find the optimal forecast result without such unpleasant effects as: overtraining, model insufficiency. MAE and MSE metrics were considered and implemented to determine the quality, such a set can show business value, for example, MAE will only show the absolute error, which can tell the quality of the model for decision makers, and MSE metric, which can be useful for neural network model engineers for quality improvement using gradient descent. To implement the forecast model, the Python programming language was used using Numpy, Pandas and Sklearn libraries. The result of the theoretical study of the predictive model is a consistent study of details and definitions in relation to the theoretical basis for understanding what problems are solved by decision trees and why they can be used to create a forecast in the energy field. The result of practical implementation is a model with an absolute average error of 6.90%, which means that the model is adequate and workable, it can be used both as a basis for forecasting and as a self-sufficient model. The study provides an algorithm and demonstrates the implementation of a sequence of actions for creating a predictive model regardless of its type and architecture, providing insight not only in the details of implementation with the help of specific tools, but also at a more abstract level of description of actions. Also demonstrated is work with data processing to meet the needs of models, creation of new variables, and data transformation, which is also a mandatory practice for obtaining quality results. The absolute average error gives general information about the quality of the created model, but specific results can also give certain information in terms of a specific country, for example, the result of the forecast for Ukraine for 2021 is -1.90 value of the target variable "Net electricity import as share of demand", in while the true value is -3.40, the difference between the two figures is even smaller than the expected error.
- Front Matter
1
- 10.1111/ijcp.13391
- Sep 26, 2019
- International Journal of Clinical Practice
'What I cannot create (and control), I do not understand' (Richard Feynman; modified Bertolero & Bassett, 20191) As anyone with an interest in the works of JK Rowling knows nifflers like shiny treasure and go to extreme lengths to find it. Cardiovascular disease (CVD) physicians are similar in their wish to find deposits of atherosclerosis but are far less accomplished at it. Atherosclerosis is a cryptic disease starting in the vascular wall and only later manifesting within the artery lumen with long-term consequences in the form of plaque rupture or erosion (type 1 lesions) but also vasospasm through secondary endothelial dysfunction (type 2 disease).2 Detecting atherosclerosis is possible using imaging either thorough the detection of early lesions on ultrasound or in the vessel wall (intima-media thickness) and late-stage calcified plaques (coronary artery calcium).3 The most sophisticated approach is to image atheroma in the wall either in large arteries by magnetic resonance imaging or in coronary arteries by intravascular ultrasound on angiography. Further developments now include three-dimensional imaging techniques applying computerised image reconstruction techniques. However, all of these direct approaches are limited in their application by the expense and size of the machinery required and the logistics of managing patient flows to central sites. Instead a cheaper and easier approach is pursued by all health systems. The availability of large epidemiological databases and cohort studies now extending in some cases to up to three generations (Framingham)4 means that high-risk individuals can be identified easily from common parameters. These studies maintain assay standardisation which may not apply to electronic health records (EHRs) linked to standard laboratory assays which evolve with time.5 Landmark analyses starting in 1987 identified certain key CVD risk factors and remarkably quickly these were standardised as age, gender, smoking, blood pressure, diabetes and cholesterol (later divided into total and high-density lipoprotein (HDL) cholesterol).4, 6, 7 A multitude of additional CVD risk factors have since been described but all of these added little to the basic predictive model which is mostly driven by age, gender and ethnicity.8 Risk factor counting and set intervention levels were the basis of defining high-risk patients for intervention. These still persist in modern guidelines, for example, stage 2 hypertension or total cholesterol > 7.5mmol/L and more usefully the concept of two CVD risk factors predicting lifetime risk from age 55.9 Yet these crude cut-offs had the significant limitation that they only identified a small fraction of patients at risk of CVD—that is, high specificity but limited sensitivity. The next development in the 1990s was the beginning of the use of mathematical models based on logistic regression analyses of epidemiological datasets once semiconductor-based scientific notation calculators became available. These could be simplified into paper-based systems or mechanical tools for routine clinical use.10, 11 Now that substantial computing capacity is available through cell phones or internet-based systems these are now universally recommended for assessment of patients with a risk of CVD. The desire to increase convenience has now led to the wish to simplify the process further by abolishing the most logistically difficult (and expensive) aspect which comprises the cholesterol blood tests. In fact, the Framingham risk engine can be easily reformatted by substituting body mass index for lipids but surprisingly this has not achieved great popularity for initial risk stratification despite its simplicity.12 The main quest in CVD risk estimation, however, has been to improve sensitivity and specificity. The main methods used have been to use larger more representative datasets based either on aggregating epidemiological cohort studies (eg US atherosclerotic CVD score- ASCVD13) or national EHRs (eg QRISK in the UK14). The best predictive performance of epidemiological datasets is an average area under curve (AUC) for receiver operator characteristic (ROC) curves (ie C-statistic) of approximately 0.70-0.75.11 Adding imaging data from coronary artery calcium increases this to 0.79 with less benefit from the far more convenient ultrasound techniques or biomarkers such as high-sensitivity troponin measurements.15, 16 The next great hope is to exploit the developments in electronic databases and advances in computing. Models to date have relied on deterministic processes guided by humans yet the suspicion has remained that information may have been lost by these decisions so other statistical data interpretation techniques are being explored. Machine learning and Bayesian analysis are the current trendy concepts but many others exist. Neural nets, the best known form of machine learning, were first described in 1943 but it has taken 70 years to make them practical as they require large scale computing to make them practical.17, 18 The techniques of neural net analysis rely on large scale data inputs, intermediate layer (or layers) of nodes linked back to the data and forward to the outputs—in this case CVD events (Figure 1). Nodes are set randomly and then iterate and adjust input weights to optimise the prediction of the outputs.17, 18 Finally, as in classic epidemiological models the outputs are validated in another dataset. In contrast to classic calculators, neural nets are multilayer of which many aspects are obscured but if collapsed down to a single layer these can be isolated and described in classic terms. Whether this concept represents an electronic obscurial more than just a black box is the subject of debate. Until now the commonest application of neural nets in medicine has been in the analysis of images as these were data rich and the most problematic for classical methods.19 The problem in CVD for risk prediction has been the availability of large EHR datasets. This is now changing with the rapid computerisation of health systems. In this issue of International Journal of Clinical Practice, Quesada and colleagues describe multi-model analyses of an EHR comprising 38.527 patients with a 5-year follow-up and a likely 5%-10% CVD event rate as is typical in cohorts of this type.20 In their analysis quadratic discriminant analysis and Naïve Bayes ranked above (area under curve [AUC] 0.70) neural nets and classical logistic regression model -derived calculators such as the European Systematic COronary Risk Evaluation (SCORE; CVD mortality alone21) or the US Framingham study-related REGICOR score for CVD events (AUC = 0.63).22 Ten of 15 computer models were better than the classical methods but not by much. This is common in studies which attempt to improve the standard AUC of 0.65-0.75 found for classical CVD risk calculators in populations that match their original derivation and validation cohorts. This study lacked comparisons with recalibrated Spanish cohorts as opposed to generic models so it is unclear how much extra predictive capacity was actually added. Other studies have compared computerised models including neural nets with logistic regression models. One study using 689 patients from India, but using a validation population of 5209 US patients from the Framingham study, pre-specified classical risk factors and a quantum neural net approach suggested that this model was superior to the classical FRS.23 This is not surprising as the CVD risk factor weighting is different in Indian Asians from US populations. A similar criticism would apply to the Korean National Health and Nutrition Evaluation study (KNHANES-6) using 4244 EHR records with complete pre-specified six CVD risk factor data and a deep belief network (DBN) analysis and a restricted Boltzman Hopfield network that optimised to six nodes in one layer.24 The statistical DBN gave an AUC of 0.79 compared with 0.72 for logistic regression. This study did not assess their performance against classical or modified (ie recalibrated) CVD risk calculators. These have been investigated in Korean populations where in a study of 200 010 patients the ASCVD equation has an AUC of 0.73-0.75 but calibration errors with an excess 57%-74% in men and a deficit of 28% in women but was useful in enabling a Korean-specific CVD score to be derived.25 The neural net analysis of the Multi-Ethnic Study of Atherosclerosis (MESA) cohort of 6814 patients followed up for 12 years and 735 variables derived from biochemistry, questionnaires and imaging was used by random survival forest analysis to derive top 20 predictors for individual CVD outcomes.15 In this study, nine models were tested including Cox and LASSO-Cox models, and Aikake information criterion applied to regression analysis as well as random survival forest analysis. Predictably age was the most important predictor of mortality. Coronary artery calcium was the best predictor of coronary heart disease or CVD with glucose and carotid ultrasound for stroke. In contrast to usual expectations, troponin was the strongest predictive of heart failure while NTproB-type natriuretic peptide was the best predictor of CVD. A UK study used data from 378 256 primary care patients in the Clinical Practice Research Database (CPRD) and 24 970 recorded CVD events (6.6%) to compare various computational methods of CVD risk prediction.26 This study compared the US ASCVD score (not interestingly UK QRISK) with machine learning models. The standard ASCVD model had an AUC of 0.73, with the random forest model 0.75, logistic regression, gradient boosting or neural networks 0.76. The neural network algorithm predicted 4998 of 7404 cases (sensitivity 68%, positive predictive value (PPV) 18%) and 53 458 of 75 585 non-cases (specificity 72%, negative predictive value (NPV) 96%), predicting 8% more patients who developed CVD compared with the established ASCVD baseline model which predicted 53 106 non-cases from 75 585 non-cases, resulting in a specificity of 70% and NPV of 95%. As is true of all CVD risk prediction models because of their structure of containing many unaffected patients the greatest power is to rule out disease (negative predictive value). The small addition to risk prediction, which is better described in the form of net (or total) reclassification indices (NRI), is not unusual in this type of analyses.27 More recently a comparative study was conducted in a cohort of 109 490 individual using aggregated and longitudinal features from EHR involving analysis of historical and prospective phases.28 The models tested included logistic regression, random forests, gradient boosting trees, convolutional neural networks (CNN) and recurrent neural networks with long short-term memory (LSTM) units. A further analysis of 10 612 patients used late-fusion approach to incorporate genetic risk score data. The ASCVD equation achieved a typical ROC AUC of 0.73, while machine learning models using only classical CVD risk factors doing no better. Incorporation of EHR features mostly relating to the length of the EHR and variances in biochemical analytes achieved an AUC of 0.77-0.78. By adding temporal features, logistic regression (LR), gradient boosting trees (GBT) and deep learning models improved the AUC to 0.78-0.79. Both GBT and convolutional neural networks (CNN) achieved an AUC of 0.79 (ie 7.9% improvement from baseline). Most of the studies reviewed in this article use ROC curves which present graphically the trade-off between the true positive rate (TP) (sensitivity) and false positive (FP) (1-specificity) rate for a predictive model using different probability thresholds. In contrast Precision-Recall curves (PRC) and their graphical outputs summarise the trade-off between the true positive (TP) rate and the positive predictive value (PPV; precision) for a predictive model using different probability thresholds.29 Mathematically ROC curves are appropriate when the observations are balanced between each class, whereas PRCs are appropriate for unbalanced datasets as is commonly the case for epidemiological cohort datasets being used to predict events as only a minority develop CVD. In this study Area under PRC (AUPRC) analysis showed that machine learning using temporal features improved predictions founded on baseline data (0.25-0.29 vs 0.19, a 33%-44% improvement) more clearly than that for ROC curves. The top features in all machine learning models include some conventional CVD risk factors such as age, blood pressure (BP) and total cholesterol, as well as several new features not included in standard CVD risk calculators such as body mass index (BMI),30 creatinine,31 glucose.32 However, all of these have been previously identified in the Framingham study or have been used other CVD scoring systems (eg QRISK).12, 14 Among drug therapies use of anti-platelet agents was also predictive. Distribution data for laboratory values (eg fasting lipid values) and physical measurements (eg BMI and blood pressure) contributed more than median values to the models. In the models incorporating longitudinal data such as logistic regression selected biochemical data distribution in two separate sampling periods while random forests selected BMI. The effect of variation in CVD risk factors such as blood pressure, cholesterol, glucose and body mass index (BMI) has previously been linked to risk of CVD in classic epidemiological studies.33 This has been validated for blood pressure and is included in one currently nationally approved CVD risk calculator (QRISK-3).14 It also exists for glucose and cholesterol but this data has not been included in any guideline approved CVD risk calculator to date. Gradient boosting tree (GBT) analysis preferred historic diagnostic codes such as heart valve disorders, lipid disorders and hypertension over other features. One problem of large scale EHRs is the quality of data recording so this historical data may reflect single anomalous values being entered as diagnostic codes (ie a proxy for variance) or the lack of original untreated values in the EHR. Similar considerations apply to anti-platelet therapies such aspirin-clopidogrel acting as proxies for unrecorded diagnoses of significant CVD (or peripheral arterial disease) or in the case of aspirin alone—clinical suspicion of high-risk status. Genetic risk scores (GRS) are easily derived given the increasing ease of obtaining large scale genome variation data. Many studies are now investigating the utility of adding GRS to classical CVD risk factors in risk prediction.34 A multiplicity of scores have been investigated using limited panels and whole genome data applied to cohort data sets of up to 300 000 patients but whether any of these are superior to imaging remains unclear.34 GBT using classical CVD risk factors gave similar results to standard methods. Adding longitudinal EHR features to GBT increased AUC to 0.71 vs 0.70; AUPRC of 0.43 vs 0.40 and the genetic risk score (GRS) improved the AUROC and AUPRC by 2% and 9%. The GRS data included known CVD risk factor genes such as melanoma inhibitory activity protein 3 (MIA3; 2 loci) also known as Transport and Golgi organisation protein 1 (TANGO1) involved in chylomicron and very low-density lipoprotein transport, and lipoprotein (a) (LPA;2 loci) as well as chemokine C-X-C motif chemokine 12 (CXCL12) (stromal cell derived factor-1) involved in inflammation and a check point gene cyclin-dependent kinase Inhibitor 2A (CDKN2A) involved in angiogenesis. As in the field of CVD risk scores standardisation of inputs and data transparency are becoming essential to allow comparison of different strategies for the purposes of quality appraisal for evidence-based guidelines. The variety and quality of data set reporting, analytical and statistical approaches, provision of absolute as opposed to relative effect sizes and lack of specificity and sensitivity data at set points remain common problems.35, 36 Such approaches are now standard for epidemiological cohorts (CONSORT statement) and diagnostic assays (STARD).35, 36 Reporting standards have been introduced for single-nucleotide polymorphism (SNP) association data and genome wide association studies to provide greater clarity for journal referees and editors assessing these studies and for readers to understand them and conduct validation studies. The increasing popularity and complexity of mathematical models applied to CVD and other endpoint data means that similar provisions need to be applied to these studies as well.37 The electronic nature of modern scientific literature means model derivation structures and data can easily be added as appendices or contributed to public scientific data repositories. A number of publications and review articles have begun to request certain details of mathematical models in addition to data and ideally model transparency and ideally availability. A suggested scheme based on data presented in studies reviewed in this field is presented in Table 1. 1. Formal presentation of research questions 2. Data selection Public databases vs electronic health record databases vs registry data 3. Hardware selection 4. Data preparation 5. Feature selection This should not be necessary. Multi-dimensional datasets may require strategies such as vector embedding to enable features to be passed to other directed learning models 6. Data splitting Design and justify the proportion of training, validation and testing in the dataset (ie 70/10/20 or 80/10/10 or 60/20/20) and ideally provide comparison data 7. Modelling selection 8. Technical details for model Specification of technical terms to communicate with data scientists or programmers and allow understanding of the process of model development (learning rate selection, tuning hyperparameter, batch dropout and normalisation, regularisation strategies, loss function selection and network optimisation). Methods used in model structure- logistic regression, Cox regression, random forest or gradient boosting models, neural networks 9. Evaluation of model discrimination and calibration Precision recall curve (PRC; unbalanced data) or receiver operator curve (ROC; balanced data) analysis of data with presentation of C-statistics, Brier scores from probabilistic outcomes Presentation of NPV, PPV, sensitivity and specificity at specific set points Comparison with standard statistical approaches (ie multi-variable regression), goodness-of-fit, calibration plots or the decision-curve analysis 10. Clinical Validation Comparison with expert opinion or published data of other current clinical strategies 11. Publication and transparency Sharing of codes with journal (ie online supplements) or public space (ie Github, bioRxiv). Directed learning methodologies should be clearly explained in data appendices. Consider strategies for computational anonymisation It will take consensus conferences between investigators, journal editors, computer modelling specialists, evidence assessment groups and ideally academic funding agencies to finalise and agree the final set of quality metrics. These aurors will then pronounce on the quality of the work submitted. After all you need to remove the magic is to identify the underlying nature of the work—or to truly see Grindelwald. Otherwise fantastic means mythical as opposed to wonderful. The authors thank Dr Scooter Morris of the Pharmaceutical Chemistry group in the School of Pharmacy at the University of California, San Francisco, USA for his helpful comments on this manuscript. None.
- Research Article
3
- 10.47514/kjcs/2024.1.3.0013
- Sep 30, 2024
- Kasu Journal of Computer Science
Background: Education is a vital component of both societal and individual growth. To create effective learning environments, it is essential to understand the factors that influence student achievement. However, the application of machine learning algorithms can drive positive change, providing higher education institutions with effective solutions to their challenges. Aim: This research develops a predictive model to identify and analyze the factors affecting student academic performance, aiming to provide institutional administrators and lecturers with a better understanding of the key factors impacting student success. Method: A comprehensive dataset was collected from undergraduate students at Modibbo Adama University (MAU), Yola, comprising student-related, home-related, lecturer-related, and institution-related factors. Model development was carried out using Python in Google Colab, a cloud-based Jupyter notebook environment. Classification algorithms such as K-Nearest Neighbors (KNN), Decision Tree (DT), Gradient Boosting Method (GBM), and Random Forest (RF) were applied to predict student academic performance. Four evaluation metrics namely; accuracy, precision, recall, and f1-score were used to analyze the models. Results: The Random Forest model outperformed the other machine learning models, achieving an overall accuracy of 95%. For predicting low-performing students, the model achieved a precision of 0.9677, recall of 0.9474, and f1-score of 0.9574. For high-performing students, the precision was 0.9324, recall was 0.9583, and f1-score was 0.9452. These strong performance metrics across both low and high performing student groups demonstrate the effectiveness of the Random Forest model in accurately predicting student academic performance based on the factors considered in the study. Feature importance analysis identified lecture attendance, sponsor support, quality of facilities, and lecturer clarification as the most influential factors on student performance. Other features, such as accommodation, employment status, and program preference, were found to have a low impact. These findings emphasize the importance of considering a comprehensive set of student, lecturer, institution, and home-related factors to sustain a conducive learning environment and enhance educational practices.
- Research Article
- 10.52783/jisem.v10i25s.3925
- Mar 27, 2025
- Journal of Information Systems Engineering and Management
Introduction: Smart cities thrive on innovative technologies, and artificial intelligence (AI) plays a pivotal role in enhancing customer-centric services. In the context of the banking sector, customer retention is vital for maintaining competitiveness, especially in the highly dynamic urban environments of smart cities. Objectives: The main objective of this study is to investigate the application of supervised machine learning algorithms to predict customer churn, a critical factor in developing efficient retention strategies. Methods: This work uses a dataset of 10,000 customer records, models such as Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting, LightGBM, XGBoost, and Naive Bayes were evaluated. Preprocessing and analysis were conducted with key metrics including accuracy, precision, recall, F1-score, cross-validation, and AUC-ROC. Results: The results reveal that ensemble models, particularly Gradient Boosting, XGBoost, Random Forest, and LightGBM, deliver superior performance on unbalanced data, achieving accuracies of 85.65%, 85.65%, 85.25%, and 85.35%, respectively. Conclusion: On balanced data, LightGBM outperformed others with an accuracy of 84.21%. These findings highlight the potential of AI-driven predictive models to empower banking institutions in smart cities, fostering better customer retention and contributing to sustainable urban development.
- Research Article
1
- 10.12928/telkomnika.v23i2.26412
- Apr 1, 2025
- TELKOMNIKA (Telecommunication Computing Electronics and Control)
This research aims to develop a predictive model using face recognition-based attendance data and integrating decision support system (DSS) theory with machine learning (ML) techniques to identify high-performing teachers at vocational high schools (SMKs). The novelty of this research lies in integrating theory with the use of face recognition data and ML algorithms to predict and identify high-performing teachers, thereby enhancing decision-making processes and teacher performance management in SMK schools. The dataset consists of SMK teachers' attendance data obtained through a face recognition attendance system, totaling 998 entries. This research employs sensitivity analysis concepts from DSS theory and classification approaches from ML models utilizing support vector machine (SVM), decision trees (DT), and random forest (RF). The models are trained and tested on Google Colab using Python, with data distribution guided by the Pareto principle. The research findings indicate that integrating DSS theory with ML contributes to innovation and benefits in improving decision-making and teacher performance management by successfully predicting high-performing teachers. Evaluation results show the highest accuracy rate of 98% with the RF model, making it the best predictive model compared to the other two models.
- Research Article
4
- 10.15537/smj.2025.46.5.20250080
- May 1, 2025
- Saudi medical journal
To identify the factors associated with post-stroke depression (PSD) and develop a machine learning predictive model using a large dataset, considering sociodemographic, lifestyle, and clinical factors. Our 2025 study used data from the 2023 Behavioral Risk Factor Surveillance System, released in September 2024. Data processing was carried out using Google Colab and Python. We carried out descriptive statistics, logistic regression, and feature importance analyses (mutual information and adjusted mutual information). A total of 4 machine-learning models were trained and evaluated: random forest, decision tree, gradient boosting, and logistic regression. Model performance was assessed using the accuracy, precision, recall, harmonic mean of precision and recall (F1-score), and area under the curve - receiver operating characteristic (AUC-ROC). The best-performing model was fine-tuned using GridSearchCV with 5-fold cross-validation. Increasing age, male gender, being married, higher income, and physical activity were associated with lower odds of PSD. Obesity, smoking, diabetes, and high cholesterol are associated with increased odds of PSD. Age and gender were the most informative features for predicting the PSD. Random forest demonstrated the best performance for predicting PSD (accuracy=0.73, precision=0.71, recall=0.77, F1-score=0.74, and AUC-ROC=0.81), which was further improved by hyperparameter optimization. Post-stroke depression's complex etiology involves sociodemographic, lifestyle, and clinical factors, notably age and gender. A random forest model effectively predicts PSD, highlighting the need for comprehensive assessment, early intervention, and management of modifiable risks (obesity, smoking, and inactivity) to improve stroke survivors' outcomes.
- Research Article
3
- 10.31449/inf.v47i6.4691
- Jun 14, 2023
- Informatica
In this paper, investigation was made to evaluate the effectiveness of the different classifiers suitable to predict the probability of cyber threat or fraudulent intent applicant during Mobile Money Service on-boarding or service activation process with the goal of determining the best machine learning model for the predictive model solution. Experimental work was carried out by formulating cyber threat predictive models, using six (6) supervised machine learning algorithms, Logistic regression, Naïve Bayes, Shallow Neural Network (SNN), Deep Neural Network (DNN), Classification and Regression Trees (CART) and Random Forest (RF) of different configurations. Each model was simulated with Synthetic Minority Operation Techniques (SMOTE) and without SMOTE (No-SMOTE) application on 25,000 dataset records of mobile money applicants. Twenty-four (24) different configurations of the formulated predictive models were simulated and evaluated using Python programming language. Simulation results of the predictive models proved that the Random Forest model multiclass configurations with SMOTE dataset outperformed all other configurations. The results also showed that the multiclass experiments with SMOTE has a better performance than the binary configuration without SMOTE of the predictive models. The study concluded that using the Random Forest based predictive machine learning model will increase the security level of Mobile Money solution by detecting and preventing anomalous customer registrations during mobile money for the unbanked on-boarding process.
- Research Article
59
- 10.1109/access.2021.3059018
- Jan 1, 2021
- IEEE Access
Ground vibrations caused by blasting operations in cement canisters is among the main mining issues that cause significant disruptions to nearby buildings and infrastructure. This research was performed in a limestone quarry situated southeast of Helwan City, Egypt, to investigate the impact of ground motion vibration due to cement blast action in limestone rocks. To reduce the environmental impact of quarry blasting, continuous monitoring, and accurate medium's peak particle velocity (PPV) assessment are required. Recently, machine learning (ML) models are employed in diverse applications. The default hyperparameters of such models must be modified to fit the problem concerned. The hyperparameters optimization for ML models impacts the employed model's performance and efficiency. In this research, different regression models are implemented for predicting the PPV values. A dataset representing 1438 blast incidents in the Helwan area was built and utilized to evaluate the considered ML models. This dataset incorporates the relationship of the ground vibration amplitude to both the explosive charge weight per delay and distance from the blast. The predictive models' output performance has been evaluated using the root-mean-squared error (RMSE) and the coefficient of determination ($R^{2}$ ). The PPV dataset has been divided into training and testing data to produce statistically significant results and to make the dataset more representative to avoid overfitting. The utilized test PPV dataset acts as a proxy for any new PPV data prediction. There was evidence of higher performance in the developed Decision Trees model with the lowest RMSE and the highest $R^{2}$ on training and testing data. The decision tree is, therefore, an acceptable algorithm for the construction of a predictive PPV model for other quarry blasting areas with conditions identical to those in Helwan. Finally, comparative experimental results have shown that optimized models can predict PPV values with lower errors and greater prediction accuracy.