Combining Random Forest and Monte Carlo Method to Determine the Driving Factors and Uncertainty of Forest Age Prediction in Northwest China
Combining Random Forest and Monte Carlo Method to Determine the Driving Factors and Uncertainty of Forest Age Prediction in Northwest China
- Conference Article
- 10.1063/5.0109130
- Jan 1, 2022
Classification modeling is currently growing and its use is often found in various fields of work. A lot of researches have been conducted to determine the best classification method in predicting a class of an observation. Most of it says that bagging and random forest methods are the best in predicting a class of observation. However, most of the classification methods will encounter a problem when it is used to modeling an unbalanced data. It is also known that the number of school dropout is relatively less than the number of students who are currently active, so this can be a case study of unbalanced data. The purpose of this research is to compare the performance of bagging and random forest method before handling unbalanced data and the bagging and random forest method after handling unbalanced data with Synthetic Minority Oversampling Technique (SMOTE). The comparison of performance can be seen from the sensitivity score, balanced accuracy, and F1 score of each classification method. The comparison results show that the random forest method has better performance than the bagging method, both before and after handling unbalanced data.
- Research Article
69
- 10.1021/ci050049u
- Apr 21, 2005
- Journal of Chemical Information and Modeling
The random forest and classification tree modeling methods are used to build predictive models of the skin sensitization activity of a chemical. A new two-stage backward elimination algorithm for descriptor selection in the random forest method is introduced. The predictive performance of the random forest model was maximized by tuning voting thresholds to reflect the unbalanced size of classification groups in available data. Our results show that random forest with a proposed backward elimination procedure outperforms a single classification tree and the standard random forest method in predicting Local Lymph Node Assay based skin sensitization activity. The proximity measure obtained from the random forest is a natural similarity measure that can be used for clustering of chemicals. Based on this measure, the clustering analysis partitioned the chemicals into several groups sharing similar molecular patterns. The improved random forest method demonstrates the potential for future QSAR studies based on a large number of descriptors or when the number of available data points is limited.
- Research Article
3
- 10.1002/chin.200539199
- Aug 31, 2005
- ChemInform
The random forest and classification tree modeling methods are used to build predictive models of the skin sensitization activity of a chemical. A new two-stage backward elimination algorithm for descriptor selection in the random forest method is introduced. The predictive performance of the random forest model was maximized by tuning voting thresholds to reflect the unbalanced size of classification groups in available data. Our results show that random forest with a proposed backward elimination procedure outperforms a single classification tree and the standard random forest method in predicting Local Lymph Node Assay based skin sensitization activity. The proximity measure obtained from the random forest is a natural similarity measure that can be used for clustering of chemicals. Based on this measure, the clustering analysis partitioned the chemicals into several groups sharing similar molecular patterns. The improved random forest method demonstrates the potential for future QSAR studies based on a large number of descriptors or when the number of available data points is limited.
- Research Article
14
- 10.54076/jumpa.v3i2.305
- Sep 30, 2023
- Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA)
At this time in the era of cars that use renewable energy fuels such as electric cars which are highly supported by the government so that it has an impact on used cars based on these problems an analysis is needed. Determining whether or not the price of buying or selling a used car is appropriate is one of the obstacles faced by the community in making decisions when buying or selling a car or vehicle. Therefore, most people choose an alternative by buying a used car that is still good and usable. One way to make price predictions is to use the Machine Learning method. In this study the authors used random forest and decision tree methods to predict car prices. The results of the research on car price prediction analysis using the random forest and decision tree methods have different percentage results. Where using the random forest method there is an accuracy: 72.13% whereas with the analysis of the decision tree method accuracy: 67.21%. So it can be concluded that the Random Forest method has better analytical accuracy than the Decision Tree method.
- Research Article
- 10.31891/csit-2024-2-4
- Jun 27, 2024
- Computer systems and information technologies
A huge amount of data is collected and generated in modern sports. This data can be used to improve athletes' performance, make more informed coaching and strategic decisions, and increase fan engagement. However, processing, analyzing, and interpreting this data can be challenging. This article is devoted to the development of an information system for data processing in the sports sector using the random forest method. The system aims to ensure efficient collection, processing, and analysis of large amounts of data generated during sports competitions, training, and interaction with fans and other stakeholders. Research methods. This article proposes an information system (IS) for data processing in the sports industry using the Random Forest (RF) method. As one of the machine learning methods, it is well suited for working with large amounts of data and complex classification and prediction tasks. The proposed IS consists of three main components. The data collection module accumulates data from various sources such as sensors, GPS trackers, websites, and social networks. The data processing module cleans, normalizes, and transforms the data to prepare it for analysis. The data analysis module uses the RF method to analyze data, predict outcomes, identify patterns, and make decisions. The conducted research has shown that the proposed IS can be an effective tool for predicting the results of sports competitions with high accuracy, identifying patterns in the data that can be useful for coaches and athletes to improve their training and strategy, personalizing training programs and recommendations for athletes, increasing the level of fan engagement by providing them with personalized content and forecasts. The proposed IS based on the random forest method is a powerful tool for processing and analyzing data in the sports industry. Its use can lead to improved athletes' performance, more informed coaching and strategic decisions, and increased fan engagement. One of the most powerful and accurate machine learning methods, the random forest method, allows for reliable analysis and forecasting based on various types of data, including player statistics, match results, physiological indicators, and fan behavior data. The article describes the stages of creating an information system: from data collection to data processing, storage, and analysis.
- Book Chapter
- 10.1007/978-981-99-1428-9_64
- Jan 1, 2023
Objective: To construct a common traditional Chinese medicine composite syndrome model based on multiple information processing methods. Methods: 1132 cases of colorectal cancer were collected by epidemiological investigation, and the case information of colorectal cancer patients was modeled by cluster analysis, BP neural network, SVM support vector machine and random forest method. Results: Among the syndrome models constructed by BP neural network, support vector machine and random forest, random forest had the best effect, and the recognition rate of each syndrome type was respectively: spleen deficiency and qi stagnation (65.1%), spleen and kidney yang deficiency (83.3%), kidney essence deficiency (92.3%), accumulation of damp and heat (97.7%), and deficiency of both qi and blood (96.3%). Conclusion: The common TCM complex syndrome model was successfully constructed, and the random forest method has the highest accuracy in judging syndrome types. The application of random forest modeling method can provide new ideas and methods for the standardization of TCM syndrome research.
- Research Article
5
- 10.33200/ijcer.1192590
- Mar 31, 2023
- International Journal of Contemporary Educational Research
The research aims to determine the factors affecting PISA 2018 reading skills using Random Forest and MARS methods and to compare their prediction abilities. This study used the information from 5713 students, 2838 (49.7%) male and 2875 (50.3%) female in the PISA 2018 Turkey. The analysis shows the MARS method performed better than the Random Forest method. The most significant factor affecting reading skills in Turkey is “the number of books in the house” in both methods. The variables the MARS method finds significant are “students' perception of difficulty, motivation for reading skills, father’s educational status, reading pleasure, bullying experience of the student, mother's educational status, attitude towards school, classical artifacts at home, supplementary school books at home, competition at school, competitive power, cooperation perception at school, reading frequency, self-efficacy, poetry books at home, anxiety about reading skills and teacher support.” However, the other variables had no relation to prediction. This study is expected to serve as an example of data mining application in educational research
- Research Article
29
- 10.1371/journal.pone.0106117
- Aug 28, 2014
- PLoS ONE
PurposeTo diagnose glaucoma based on spectral domain optical coherence tomography (SD-OCT) measurements using the ‘Random Forests’ method.MethodsSD-OCT was conducted in 126 eyes of 126 open angle glaucoma (OAG) patients and 84 eyes of 84 normal subjects. The Random Forests method was then applied to discriminate between glaucoma and normal eyes using 151 OCT parameters including thickness measurements of circumpapillary retinal nerve fiber layer (cpRNFL), the macular RNFL (mRNFL) and the ganglion cell layer-inner plexiform layer combined (GCIPL). The area under the receiver operating characteristic curve (AROC) was calculated using the Random Forests method adopting leave-one-out cross validation. For comparison, AROCs were calculated based on each one of the 151 OCT parameters.ResultsThe AROC obtained with the Random Forests method was 98.5% [95% Confidence interval (CI): 97.1–99.9%], which was significantly larger than the AROCs derived from any single OCT parameter (maxima were: 92.8 [CI: 89.4–96.2] %, 94.3 [CI: 91.1–97.6] % and 91.8 [CI: 88.2–95.4] % for cpRNFL-, mRNFL- and GCIPL-related parameters, respectively; P<0.05, DeLong’s method with Holm’s correction for multiple comparisons). The partial AROC above specificity of 80%, for the Random Forests method was equal to 18.5 [CI: 16.8–19.6] %, which was also significantly larger than the AROCs of any single OCT parameter (P<0.05, Bootstrap method with Holm’s correction for multiple comparisons).ConclusionsThe Random Forests method, analyzing multiple SD-OCT parameters concurrently, significantly improves the diagnosis of glaucoma compared with using any single SD-OCT measurement.
- Research Article
9
- 10.1167/iovs.14-14009
- Apr 17, 2014
- Investigative Opthalmology & Visual Science
To combine multiple Heidelberg Retina Tomograph (HRT) parameters using the Random Forests classifier to diagnose glaucoma, both in highly and physiologically myopic (highly myopic) eyes and emmetropic eyes. Subjects consisted of healthy subjects and age-matched patients with open-angle glaucoma in emmetropic (-1.0 to +1.0 diopters [D], 63 and 59 subjects, respectively) and highly myopic eyes (-10.0 to -5.0 D, 56 and 64 subjects, respectively). First, area under the receiver operating characteristic curve (AUC) was derived using 84 HRT global and sectorial parameters and the representative HRT raw parameter (largest AUC) was identified. Then, the Random Forests method was carried out using age, refractive error, and 84 HRT parameters. The AUCs were also derived using the following: (1) Frederick S. Mikelberg discriminant function (FSM) score, (2) Reinhard O.W. Burk discriminant function (RB) score, (3) Moorfields regression analysis (MRA) score, and (4) glaucoma probability score (GPS). In combined emmetropic and highly myopic population, AUC with Random Forests method (0.96) was significantly larger than AUCs with the representative HRT raw parameter (vertical cup-to-disc ratio [global], 0.89), FSM (0.90), RB (0.83), MRA (0.87), and GPS (0.81) (P < 0.001). Similarly, AUC with the Random Forests method was significantly (P < 0.05) larger than these other parameters, both in emmetropic and highly myopic groups. Also, the Random Forests method achieved partial AUCs above 80%/90% significantly (P < 0.05) larger than any other HRT parameters in all populations. Evaluating multiple HRT parameters using the Random Forests classifier provided accurate diagnosis of glaucoma, both in emmetropic and highly myopic eyes.
- Research Article
- 10.62051/ijcsit.v2n3.15
- May 28, 2024
- International Journal of Computer Science and Information Technology
Landslides are the most important type of geological disaster development in China. Heavy rainfall and earthquakes often cause mass landslides. Rapid extraction of postdisaster landslide information is an important method for realizing scientific and efficient emergency investigations and can provide a key basis for disaster assessment and emergency response decision-making. Rapid extraction of postdisaster landslide information is an important method for realizing scientific and efficient emergency investigations and can provide a key basis for disaster assessment and emergency response decision-making. Using Fujian as an example, based on basic topography, geology and other data, the multiscale segmentation method is used for image segmentation, and different features are selected. The applicability of the rule-based extraction method and support vector machine, random forest and other machine learning methods. The applicability of the rule-based extraction method and support vector machine, random forest and other machine learning methods in small-scale landslide extraction in the study area is compared. The results showed that the extraction accuracy of the rule-based extraction method was 90.86%, the extraction accuracy of the support vector machine was 65.95%, and the extraction accuracy of the random forest was 93.43%. A comparison of the three methods revealed that the random forest method was most suitable for landslide extraction in the study area.
- Research Article
10
- 10.1029/2023jg007492
- Jul 1, 2023
- Journal of Geophysical Research: Biogeosciences
Forest age is one of the most important ecosystem characters for accurately estimating the magnitude and potential of carbon sink in forest ecosystems. During the past 40 years, national ecological restoration projects have led to the near doubling of the forest cover area in China, which has also substantially affected the dynamics of forest age. Therefore, there is an urgent need to generate long‐term forest age maps for China. This study reconstructed China forest age datasets (CFAD) from 1980 to 2015 at five year intervals at a 1 km spatial resolution by merging a satellite‐based forest age map in 2010 and forest cover dynamic maps from 1980 to 2015. The random forest method was used to reconstruct the forest age where forest age could not be inferred from the forest age base map in 2010 directly. CFAD showed a good agreement with the province‐level mean forest age derived from the several national forest inventories (R2 ranged from 0.66 to 0.86). In general, the younger forests are mainly distributed in southern and eastern China. The older forests are mainly distributed in the mountain areas of northeast, northwest and southwest China. The average age of China's forests increased from 18.2 to 44.0 years old from 1980 to 2015. Based on the current forest age and future afforestation planning, the average forest age in China is predicted to reach 71.6 years old in 2060. The CFAD provides an alternative data set to obtain improved estimates of local and national forest carbon sinks in China.
- Research Article
257
- 10.1109/tits.2015.2405759
- Oct 1, 2015
- IEEE Transactions on Intelligent Transportation Systems
This paper adopts different supervised learning methods from the field of machine learning to develop multiclass classifiers that identify the transportation mode, including driving a car, riding a bicycle, riding a bus, walking, and running. Methods that were considered include K-nearest neighbor, support vector machines (SVMs), and tree-based models that comprise a single decision tree, bagging, and random forest (RF) methods. For training and validating purposes, data were obtained from smartphone sensors, including accelerometer, gyroscope, and rotation vector sensors. K-fold cross-validation as well as out-of-bag error was used for model selection and validation purposes. Several features were created from which a subset was identified through the minimum redundancy maximum relevance method. Data obtained from the smartphone sensors were found to provide important information to distinguish between transportation modes. The performance of different methods was evaluated and compared. The RF and SVM methods were found to produce the best performance. Furthermore, an effort was made to develop a new additional feature that entails creating a combination of other features by adopting a simulated annealing algorithm and a random forest method.
- Research Article
48
- 10.1007/s12403-019-00335-7
- Nov 25, 2019
- Exposure and Health
High-arsenic (As) groundwater was first discovered in the Yanchi region, Northwest China, which is an arid or semiarid area, and the groundwater quality seriously affects the health of local residents. A comprehensive understanding of the spatiotemporal distribution characteristics, water quality, and health risk of high-As groundwater is indispensable for the sustainable utilization of groundwater sources and resident health. Seventy-nine groundwater samples were collected from different aquifers and seasons. The hazard quotient (HQ) and carcinogenic risk (CR) of As for adults and children were assessed. Moreover, the effects of groundwater sampling site and seasonal change on As concentration were investigated. Then, the random forest method was used to evaluate the importance of the indicators and the influence of these important indicators on groundwater classification. Thirty-three percent of the groundwater samples had HQ values > 1, and the CR values of all groundwater > 1.00 × 10−6 for children, representing a serious health risk. Twenty-one percent of the groundwater samples had health risk for adult. High-As groundwater is present at depths less than 60 m, and groundwater As concentrations are slightly affected by seasonal changes. The random forest shows that the most important indicators that affect groundwater quality are Na, TDS, TH, and F, and the least important is As. Furthermore, the optimal set of indicators contained all four of the most important indicators obtained by the random forest model, which achieved a classification accuracy of 88.21% for groundwater quality.
- Research Article
32
- 10.1007/s11629-018-4898-1
- Oct 1, 2018
- Journal of Mountain Science
Mountainous rangelands play a pivotal role in providing forage resources for livestock, particularly in summer, and maintaining ecological balance. This study aimed to identify environmental variables affecting range plant species distribution, ecological analysis of the relationship between these variables and the distribution of plants, and to model and map the plant habitats suitability by the Random Forest Method (RFM) in rangelands of the Taftan Mountain, Sistan and Baluchestan Province, southeastern Iran. In order to determine the environmental variables and estimate the potential distribution of plant species, the presence points of plants were recorded by using systematic random sampling method (90 points of presence) and soils were sampled in 5 habitats by random method in 0–30 and 30–60 cm depths. The layers of environmental variables were prepared using the Kriging interpolation method and Geographic Information System facilities. The distribution of the plant habitats was finally modelled and mapped by the RFM. Continuous maps of the habitat suitability were converted to binary maps using Youden Index (J) in order to evaluate the accuracy of the RFM in estimation of the distribution of species potential habitat. Based on the values of the area under curve (AUC) statistics, accuracy of predictive models of all habitats was in good level. Investigating the agreement between the predicted map, generated by each model, and actual maps, generated from fieldmeasured data, of the plant habitats, was at a high level for all habitats, except for Amygdalus scoparia habitat. This study concluded that the RFM is a robust model to analyze the relationships between the distribution of plant species and environmental variables as well as to prepare potential distribution maps of plant habitats that are of higher priority for conservation on the local scale in arid mountainous rangelands.
- Research Article
2
- 10.1155/2021/3230343
- Jan 1, 2021
- Advances in Civil Engineering
Liquefaction evaluation on the sands induced by earthquake is of significance for engineers in seismic design. In this study, the random forest (RF) method is introduced and adopted to evaluate the seismic liquefaction potential of soils based on the shear wave velocity. The RF model was developed using the Andrus database as a training dataset comprising 225 sets of liquefaction performance and shear wave velocity measurements. Five training parameters are selected for RF model including seismic magnitude (Mw), peak horizontal ground surface acceleration (amax), stress‐corrected shear wave velocity of soil (Vs1), sandy‐layer buried depth (ds), and a new introduced parameter, stress ratio (k). In addition, the optimal hyperparameters for the random forest model are determined based on the minimum error rate for the out‐of‐bag dataset (ERROOB) such as the number of classification trees, maximum depth of trees, and maximum number of features. The established random forest model was validated using the Kayen database as testing dataset and compared with the Chinese code and the Andrus methods. The results indicated that the random forest method established based on the training dataset was credible. The random forest method gave a success rate for liquefied sites and even a total success rate for all cases higher than 80%, which is completely acceptable. By contrast, the Chinese code method and the Andrus methods gave a high success rate for liquefaction but very low for nonliquefaction which led to the increase of engineering cost. The developed RF model can provide references for engineers to evaluate liquefaction potential.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.