APPLICATION OF THE THRESHOLD SEGMENTATION METHOD FOR ASSESSING FOREST CHARACTERISTICS BASED ON HIGH-DETAILED RESURS-P1 SATELLITE DATA
This article presents the results of a study examining the potential of threshold segmentation of intercrown areas of forest canopy images using domestic ultra-high-resolution satellite images obtained from the Resurs-P1 (Geoton-L) satellite to identify the relationship between segmentation parameters and biometric characteristics of pine stands, using the forests of the Curonian Spit National Park as an example. The proposed method is based on identifying shaded segments of the intercrown space within forest stand boundaries, taking into account a specified brightness range, and then merging adjacent pixels based on spectral proximity at a new specified brightness threshold. For each specified threshold, the areas and average brightness values of shadow segments within the stand boundaries, standard deviations, and median values are determined. Based on these values, a threshold canopy closure is calculated for each stand, taking into account only shaded intercrown spaces. Statistical characteristics of average brightness and canopy closure threshold serve as variables for regression modeling of biometric characteristics (height, diameter, and stand age) of pine forests. The regression analysis was conducted using an ensemble method with Random Forest (RF) decision tree construction. The R² coefficient of determination for pine forest characteristics ranges from 0.29 to 0.37. The results of the validation model for the test set are virtually identical to those for the training set, demonstrating the robustness of the RF model. Regression modeling of pine stand characteristics using the RF algorithm (using pure pine stands in the Curonian Spit National Park as an example), using predictors derived from threshold segmentation of forest canopy images on Geoton-L panchromatic images, yields stable results with a root-mean-square error of approximately 4 m for average height, 6 cm for diameter, and 20 years for age. Threshold segmentation of tree canopy images is useful for preliminary assessment of stand characteristics in cases where radiometric correction of spectral data is insufficient for calculating standard textural characteristics.
- Conference Article
1
- 10.1145/3297730.3297747
- Aug 25, 2018
The random forest is stochastic a forest establishment, there are many decision trees; there is no correlation between each decision tree random forest. The establishment of each decision tree, using the random sampling process is put back, and then uses the voting form of classification and prediction. The algorithm can solve the bottleneck in the performance of a single classifier, so it is widely used in many aspects. Of course, the algorithm also has some room for improvement, according to the random forest algorithm to deal with unbalanced data set when running low efficiency, this paper puts forward approaches to the problem are not a new balance at the same time as the calculation process, showing the growth of the index value, how to improve the prediction speed and shorten the running time, according to the characteristics of the random forest algorithm in the construction process is put forward Based on the domestic and foreign literatures, this paper mainly studies the optimization of random forest from two aspects. Random forest algorithm is an ensemble learning method in the field of machine learning. It is integrated with the classification results of multiple decision trees to form a global classifier. The random forest algorithm compared with other classification algorithms have many advantages, the classified effect advantage is reflected in the classification accuracy and the generalization error is small and has the ability to deal with high dimensional data, the training process of the advantages of learning algorithm of quick and easy parallelization. Based on these two advantages, random forest algorithm has been widely used, and it has become one of the priorities to deal with classification problem. However, when the data type of the unbalanced distribution of the situation, that is the number one category of samples is far less than other types of samples, random forest algorithm will appear ineffective, the generalization error of variable classification problem and a series of. So far, there is not much research on the problem of unbalanced data for random forest classification, and there is no direct and effective method. Some just combine the general processing methods of unbalanced data, such as sampling technique or cost sensitive method. So it is a significant research problem to improve the classification effect of unbalanced data from the random forest algorithm level. Based on this research, this paper analyzes the key steps in the analysis of the effect of random forest classification, and designs a solution to deal with unbalanced data. In this paper, we propose an improved random forest algorithm to deal with the problem of imbalanced data classification by studying the classification method of unbalanced data and the random forest algorithm. Mainly from two aspects of the sub space selection and model integration of random forest. In this paper, the influence of the balanced sampling on the algorithm is also combined with the experimental results. Finally, verify the improved random forest algorithm in unbalanced classification results on public data sets, compared to the original random forest algorithm, in most indicators (cross validation accuracy, AUC index, Kappa coefficient and F1-Measure index) have obvious improvement. The importance of subspace selection and model optimization for random forest algorithm is demonstrated. The research content of this paper has an important academic significance and practical value to guide the classification of imbalanced data, and can be applied to the field of spam detection, anomaly detection, medical diagnosis, DNA sequence identification and so on.
- Research Article
1
- 10.33220/1026-3365.136.2020.165
- Jun 25, 2020
- Forestry and Forest Melioration
У лісовому фонді державних лісогосподарських підприємств Житомирської області, які є характерними для Центрального Полісся, соснові насадження становлять від 31,5 до 87,7 % від площі вкритих лісовою рослинністю земель. Переважають свіжі та вологі субори й сугруди. Частка чистих соснових насаджень становить у середньому 23,9 % від площі вкритих лісовою рослинністю земель і 37,3 % від площі соснових насаджень, а потенційно можлива для створення чистих соснових насаджень – 7,7 і 11,2 % відповідно. Здійснені розрахунки є підставою для підвищення стійкості соснових насаджень регіону шляхом збільшення площі мішаних насаджень на ділянках із придатними для цього лісорослинними умовами. Водночас слід брати до уваги можливість зміни гігротопів унаслідок зміни клімату останніх років.
- Research Article
16
- 10.1109/access.2017.2656618
- Jan 1, 2017
- IEEE Access
Random Forests are powerful classification and regression tools that are commonly applied in machine learning and image processing. In the majority of random classification forests algorithms, the Gini index and the information gain ratio are commonly used for node splitting. However, these two kinds of node-split methods may pay less attention to the intrinsic structure of the attribute variables and fail to find attributes with strong discriminate ability as a group yet weak as individuals. In this paper, we propose an innovative method for splitting the tree nodes based on the cooperative game theory, from which some attributes with good discriminate ability as a group can be learned. This new random forests algorithm is called Cooperative Profit Random Forests (CPRF). Experimental comparisons with several other existing random classification forests algorithms are carried out on several real-world data sets, including remote sensing images. The results show that CPRF outperforms other existing Random Forests algorithms in most cases. In particular, CPRF achieves promising results in ocean front recognition.
- Conference Article
- 10.1109/bigdata52589.2021.9671872
- Dec 15, 2021
The United Nations collects data at gigabyte scale from the member countries for sustainable development goals, and recognizes the impact big data collection has on opportunities and risks for sustainable development. The United Nations (UN) Sustainable Development Goals (SDG) open data hub provided a dataset that details the "proportion of time spent on unpaid domestic chores and care work" world wide. This primary dataset is only a portion of the data available related to the research area. This data contains many related variables, such as the population's age, sex, and location metrics. In an effort to better understand the impact of unpaid domestic work, the dataset was analyzed in conjunction with another dataset from the UN Statistics Division that details the rate of divorce/separation in the world population. Our analysis included methods such as principal component analysis, the k-means clustering algorithm, the random forest clustering algorithm, and implementing a neural network. The principal components were clustered using the k-means algorithm to cluster the variables in a manner that explains the variance present in the dataset. The analysis found that age, sex, and location demographics are key variables that explain the diverse variation between countries and the percentage of time spent on unpaid domestic work.Machine learning algorithms enabled the confirmation of this relationship. Using the key variables identified, a random forest of decision trees and a neural network were generated to classify the percentage of time spent on unpaid domestic work. Similarly, the random forest algorithm and a neural network were also implemented to classify geographical regions. These models were compared to determine the strength of the relationship between age, sex, location metrics, the percentage of time spent, and geographical regions.The analysis detailed in this work strives to identify the social factors that classify the percentage of time spent on unpaid domestic work in accordance with the UN SDG 5.4, which is to "recognize and value unpaid care and domestic work through the provision of public services, infrastructure, and social protection policies and the promotion of shared responsibility within the household and the family as nationally appropriate."
- Research Article
- 10.47065/bits.v6i3.6476
- Dec 30, 2024
- Building of Informatics, Technology and Science (BITS)
Kidney failure is one of the most common chronic diseases worldwide. This condition occurs when the kidneys lose their ability to filter waste and excess fluid from the blood. Kidney failure is a serious condition that occurs when kidney function decreases significantly or stops altogether. Kidney failure has a wide impact on the physical, mental, and social health of patients. Therefore, early treatment and a holistic approach are needed to minimize its impact. In the health sector, technological advances have enabled more effective processing of medical data through the application of data mining. Data Mining is the process of exploring and analyzing large amounts of data to find patterns, relationships, or valuable information that was previously unknown. Classification in Data Mining is the process of grouping or categorizing data into certain classes or labels based on the attributes or features it has. In the classification itself, there are various algorithms in it such as the K-Nearest Neighbor (KNN) and Random Forest (RF) algorithms. The K-Nearest Neighbor (KNN) and Random Forest (RF) algorithms are two algorithms that are widely used in classification tasks. Therefore, this study will carry out a comparison process on the performance of the K-Nearest Neighbor algorithm and the Random Forest algorithm. Comparison of data mining algorithm performance to evaluate and determine which algorithm is the most effective and efficient in solving a particular problem based on various evaluation metrics. Overall, the accuracy value obtained is above 90%, but the Random Forest algorithm has better performance. Where the accuracy level results obtained from the Random Forest algorithm are 99.75%. Therefore, the model or pattern produced by the Random Forest algorithm will later be used to assist in the process of diagnosing kidney failure and the Random Forest algorithm is an algorithm that has better performance.
- Conference Article
14
- 10.1109/ijcnn.2016.7727772
- Jul 1, 2016
Random forests are a class of ensemble methods for classification and regression with randomizing mechanism in bagging instances and selecting feature subspace. For high dimensional data, the performance of random forests degenerates because of the random sampling feature subspace for each node in the construction of decision trees. To address the issue, in this paper, we propose a new Principal Component Analysis and Stratified Sampling based method, called PCA-SS, for feature subspace selection in random forests with high dimensional data. For each decision tree in the forests, we firstly create the training data by bagging instances and partition the feature set into several feature subsets. Principal Component Analysis (PCA) is applied on each feature subset to obtain transformed features. All the principal components are retained in order to preserve the variability information of the data. Secondly, depending on a certain principal components principle, the transformed features are partitioned into informative and less informative parts. When constructing each node of decision trees, a feature subspace is selected by stratified sampling method from the two parts. The PCA-SS based Random Forests algorithm, named PSRF, ensures enough informative features for each tree node, and it also increases the diversity between the trees to a certain extent. Experimental results demonstrate that the proposed PSRF significantly improves the performance of random forests when dealing with high dimensional data, compared with the state-of-the-art random forests algorithms.
- Research Article
64
- 10.1109/tcyb.2020.2972956
- Aug 21, 2020
- IEEE Transactions on Cybernetics
The original random forests (RFs) algorithm has been widely used and has achieved excellent performance for the classification and regression tasks. However, the research on the theory of RFs lags far behind its applications. In this article, to narrow the gap between the applications and the theory of RFs, we propose a new RFs algorithm, called random Shapley forests (RSFs), based on the Shapley value. The Shapley value is one of the well-known solutions in the cooperative game, which can fairly assess the power of each player in a game. In the construction of RSFs, RSFs use the Shapley value to evaluate the importance of each feature at each tree node by computing the dependency among the possible feature coalitions. In particular, inspired by the existing consistency theory, we have proved the consistency of the proposed RFs algorithm. Moreover, to verify the effectiveness of the proposed algorithm, experiments on eight UCI benchmark datasets and four real-world datasets have been conducted. The results show that RSFs perform better than or at least comparable with the existing consistent RFs, the original RFs, and a classic classifier, support vector machines.
- Research Article
- 10.54646/bije.2022.09
- Jan 1, 2022
- BOHR International Journal of Engineering
The Random Forest (RF) algorithm, originally proposed by Breiman et al. (1), is a widely used machine learning algorithm that gains its merit from its fast learning speed as well as high classification accuracy. However, despiteits widespread use, the different mechanisms at work in Breiman’s RF are not yet fully understood, and there is stillon-going research on several aspects of optimizing the RF algorithm, especially in the big data environment. To optimize the RF algorithm, this work builds new ensembles that optimize the random portions of the RF algorithm using genetic algorithms, yielding Random Genetic Forests (RGF), Negatively Correlated RGF (NC-RGF), and Preemptive RGF (PFS-RGF). These ensembles are compared with Breiman’s classic RF algorithm in Hadoop’s big data framework using Spark on a large, high-dimensional network intrusion dataset, UNSW-NB15.
- Research Article
1
- 10.54646/bije.009
- Jan 1, 2022
- BOHR International Journal of Engineering
The Random Forest (RF) algorithm, originally proposed by Breiman [7], is a widely used machine learning algorithm that gains its merit from its fast learning speed as well as high classification accuracy. However, despite its widespread use, the different mechanisms at work in Breiman’s RF are not yet fully understood, and there is still on-going research on several aspects of optimizing the RF algorithm, especially in the big data environment. To optimize the RF algorithm, this work builds new ensembles that optimize the random portions of the RF algorithm using genetic algorithms, yielding Random Genetic Forests (RGF), Negatively Correlated RGF (NC-RGF), and Preemptive RGF (PFS-RGF). These ensembles are compared with Breiman’s classic RF algorithm in Hadoop’s big data framework using Spark on a large, high-dimensional network intrusion dataset, UNSW-NB15.
- Conference Article
3
- 10.1109/icufn.2018.8436590
- Jul 1, 2018
Along with the steady growth of wired and wireless networks, the various new attacks targeting networks are also constantly emerging and transforming. As a efficient way to cope with various attacks, the Random Forest(RF) algorithm has frequently been used as the core engine of intrusion detection because of the faster learning speed and the higher attack detection accuracy. However, the RF algorithm has to input the number of the tree composing the forest as a parameter. In this paper, we proposed a new algorithm that limit the number of trees composing the forest using the McNemar test. To evaluate the performance of the proposed RF algorithm, we compared learning time, accuracy and memory usage of the proposed algorithm with the original RF algorithm and other algorithm by using the KDDcup99 dataset. Under the same detection accuracy, the proposed RF algorithm improves the performance of the original RF algorithm by as much as 97.76% at learning time, 91.86% at test time, and 99.02% in memory usage on average.
- Research Article
17
- 10.1007/s10661-015-4914-7
- Oct 20, 2015
- Environmental Monitoring and Assessment
Knowledge of the spatial extent of forested wetlands is essential to many studies including wetland functioning assessment, greenhouse gas flux estimation, and wildlife suitable habitat identification. For discriminating forested wetlands from their adjacent land cover types, researchers have resorted to image analysis techniques applied to numerous remotely sensed data. While with some success, there is still no consensus on the optimal approaches for mapping forested wetlands. To address this problem, we examined two machine learning approaches, random forest (RF) and K-nearest neighbor (KNN) algorithms, and applied these two approaches to the framework of pixel-based and object-based classifications. The RF and KNN algorithms were constructed using predictors derived from Landsat 8 imagery, Radarsat-2 advanced synthetic aperture radar (SAR), and topographical indices. The results show that the objected-based classifications performed better than per-pixel classifications using the same algorithm (RF) in terms of overall accuracy and the difference of their kappa coefficients are statistically significant (p<0.01). There were noticeably omissions for forested and herbaceous wetlands based on the per-pixel classifications using the RF algorithm. As for the object-based image analysis, there were also statistically significant differences (p<0.01) of Kappa coefficient between results performed based on RF and KNN algorithms. The object-based classification using RF provided a more visually adequate distribution of interested land cover types, while the object classifications based on the KNN algorithm showed noticeably commissions for forested wetlands and omissions for agriculture land. This research proves that the object-based classification with RF using optical, radar, and topographical data improved the mapping accuracy of land covers and provided a feasible approach to discriminate the forested wetlands from the other land cover types in forestry area.
- Research Article
- 10.31548/forest2019.03.062
- Sep 25, 2019
- Ukrainian Journal of Forest and Wood Science
The priority task of forestry science and forest-cultural practice is to study the influence of introduced species of woody plants on growth, productivity and quality of pine stands. Therefore, the purpose of this scientific research was to identify the features of growth and productivity dynamics of pine stands with artificial red oak admixture, created in the last century at the Boyarka Forest Research Station. All 16 sample plots were established in the fresh pine sites. Stands are mostly pure in composi tion, the proportion of red oak does not exceed 20 %. Pine stands are represented by different age groups: young, mid-aged, maturing and mature. The biometric indices of pine stands with artificial red oak admixture, features of their growth and productivity dynamics have been determined. The understory is represented by red oak, which in some cases reaches the height of the main canopy. Highly productive pine stands are characterized by Ia, Ib site index classes. As a result of the modeling of the dynamics of mean heights, diameters, and basal area, it was established that height growth of pine stands is described by the power equation with a high value of determination coefficient: R2 = 0.914. Analysis of the dynamics of height growth of the pine stands with red oak understory demonstrates the difference from the growth patterns of pure pine stands. The high thinning intensity of the researched stands at young age leads to a decrease of their height growth, as compared with other pine stands. At the age of 60 years, the heights become equal, and at the maturity age, stands with understory canopy dominate by height growth over pure pine stands by 4.4 %. At the age of maturity, the difference in the diameters between pine stands with understory canopy and pure pine stands of Ia site index class reaches 6.7 %. Peculiarities of growth of pine stands with understory canopy impose their imprint on the actual productivity of pine stands. The productivity of pine stands with red oak understory, starting at the mid-aged group, steadily rises and at the maturity age reaches the Ib site index class. This proves the viability of introduction of red oak under the canopy of pine stands.
- Research Article
1
- 10.1088/1742-6596/2907/1/012014
- Dec 1, 2024
- Journal of Physics: Conference Series
The main goal of open pit mines is to create adequate rock breakage whilst reducing adverse outcomes like flyrock, ground vibration, and back break. Of these, back break (BB) is a serious consequence of blasting in open pit mines, as it frequently diminishes economic advantages and has a negative impact on mines’ safety. As a result, accurate BB prediction is critical for design of mine blast and other production operations. In this study, grey wolf optimizer (GWO) and random forest (RF) algorithms were implemented to predict BB. 61 categories of collected data from A and B mines of Sangan iron ore complex, Iran, were considered. Seven effective parameters on BB, i.e., the ratio of row spacing to burden (S/B), blasthole length (L), specific drilling (SD), sub-drilling (U), specific charge (P), stemming (T), and average charge in each blast-hole (Q) and their corresponding BB values were measured. To implement the suggested methods, in the first stage, 48 data sets were utilized as training phase data and remaining data sets were considered as test phase data. Then, the coefficient of determination parameter (R2) was employed for the training and testing data to evaluate the efficiency of the suggested models. The precision of the GWO and RF algorithms was further evaluated in comparison to multiple linear regression (MLR) analysis. The coefficient of determination values for the GWO, RF, and MLR for the training phase were 0.922, 0.948, and 0.643 respectively, while for the testing phase were 0.959, 0.966, and 0.733 indicating that both of the artificial intelligence approaches, GWO and RF, are more efficient than the MLR. Also, the calculated values of VAF and RMSE indicators reveal the GWO and RF algorithms can accurately predict BB values. Finally, the sensitivity analysis performed on the input parameters showed that the average charge in each blasthole (Q) has the greatest impact and specific drilling (SD) has the least impact on BB.
- Research Article
4
- 10.3390/rs17020267
- Jan 13, 2025
- Remote Sensing
Nitrogen and phosphorus are limiting nutrients in freshwater ecosystems, and the remote estimation of total phosphorus (TP) and total nitrogen (TN) in eutrophic waters is of great significance. This study utilized machine learning algorithms based on Sentinel-2 satellite imagery for remote estimation of TP and TN concentrations in Lake Xingkai, Chagan and Songhua. Results indicate that random forest (RF) and XGBoost regression algorithms perform better. The performance of the GBDT algorithm was slightly lower than that of the RF and XGBoost regression algorithms, the BP algorithm had overfitting, and the SVR algorithm had poor fitting performance. Results showed that the TN concentration inversion model based on the RF algorithm had the highest accuracy (R2 = 0.98, RMSE = 0.09, MAPE = 19.74%). The Extreme Gradient Boosting (XGB) model also performed well, though slightly less accurately than RF (R2 = 0.97, RMSE = 0.14, MAPE = 20.67%). For TP concentration, the XGB model’s performance (R2 = 0.82, RMSE = 0.08, MAPE = 24.89%) was comparable to that of the RF model (R2 = 0.82, RMSE = 0.07, MAPE = 29.55%). The RF algorithm was applied to all cloud-free Sentinel-2 satellite images of these typical lakes in northeastern China during the non-glacial period from 2017 to 2023, generating spatiotemporal distribution maps of TP and TN concentrations. Between 2017 and 2023, TP concentrations in Lake Xingkai, Chagan and Songhua showed increasing, decreasing, and initially decreasing then increasing patterns, respectively. A positive correlation between temperature and TP concentration was observed, as higher temperatures enhance biological activity. In contrast, a negative correlation was found with TN concentration, as higher temperatures promote phytoplankton growth and reproduction. This study not only offers a new method for monitoring eutrophication in lakes but also provides valuable support for sustainable water resource management and ecological protection goals.
- Conference Article
22
- 10.1109/icecit54077.2021.9641120
- Sep 14, 2021
Skin disease is a very vulnerable and severe issue in today's world. Skin disorder classification is crucial for diagnosis. Several new data mining algorithms have been developed to classify and interpret medical images. The functionality of the K-Nearest Neighbors (KNN) Random Forest (RF) Algorithm is described in this article, along with an analysis of its results. Furthermore, this study demonstrates a high-performing approach that saves both effort and money. The proposed model is designed based on KNN and the Random Forest algorithm. The patient can use this model to classify his skin disease as a primary detection, and the doctor also can ensure his judgment by using this proposed model. Traditional skin disease diagnosis is an expensive and time-consuming procedure. This paper's proposed classification model will identify ten different skin diseases. The Random Forest algorithm has a testing accuracy of 94.22 percent, and K-Nearest Neighbors (KNN) has a testing accuracy of 95.23 percent. The KNN algorithm has an F1 Score of 95.98 percent, whereas the Random Forest (RF) algorithm has an F1 Score of 95.94 percent. It can be increased by expanding the dataset and more feature extraction. This approach may benefit individuals with skin illness who are looking to save money and time as well as avoid skin cancer by identifying cancer at an early stage.