What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories
On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.
- Conference Article
61
- 10.1145/3196321.3196333
- May 28, 2018
Software developers frequently solve development issues with the help of question and answer web forums, such as Stack Overflow (SO). While tags exist to support question searching and browsing, they are more related to technological aspects than to the question purposes. Tagging questions with their purpose can add a new dimension to the investigation of topics discussed in posts on SO. In this paper, we aim to automate such a classification of SO posts into seven question categories. As a first step, we have manually created a curated data set of 500 SO posts, classified into the seven categories. Using this data set, we apply machine learning algorithms (Random Forest and Support Vector Machines) to build a classification model for SO questions. We then experiment with 82 different configurations regarding the preprocessing of the text and representation of the input data. The results of the best performing models show that our models can classify posts into the correct question category with an average precision and recall of 0.88 and 0.87 when using Random Forest and the phrases indicating a question category as input data for the training. The obtained model can be used to aid developers in browsing SO discussions or researchers in building recommenders based on SO.
- Conference Article
4
- 10.1109/sera51205.2021.9509047
- Jun 20, 2021
Stack Overflow is a question-answer community that provides rich information about computer programming and technology for software developers. Users can ask and answer questions on a wide range of programming topics as well search for problems that other users have faced and find solutions that other users have suggested. From a viewpoint of a technology product owner, Stack Overflow can report various issues that product users have, and this serves as valuable input to the product improvement process. This paper proposes an automated approach to classifying questions that are posted on Stack Overflow with regard to a certain kind of products, i.e. database products in particular. The categories of questions are defined at two levels, i.e problem and subproblem. The problem level includes development, installation, and performance tuning, while the subproblem level consists of design, limitation, and discussion. By cross-combining the two levels, questions can be classified into nine problem-subproblem classes. Natural language processing and text classification are used with several machine learning algorithms, i.e. Naïve Bayes, Decision Tree, Extra Trees, Random Forest, Logistic Regression, Stochastic Gradient Descent, Deep Learning Neural Network, and Convolutional Neural Network. The best classifiers for all classes are used further in a web-based tool that can classify each question by a problem-subproblem tag and also report the number of problems that users of a database product have posted. This information can benefit the owner of a database product in planning product maintenance and evolution.
- Research Article
72
- 10.1038/s41598-021-94422-y
- Jul 28, 2021
- Scientific Reports
Urban area mapping is an important application of remote sensing which aims at both estimation and change in land cover under the urban area. A major challenge being faced while analyzing Synthetic Aperture Radar (SAR) based remote sensing data is that there is a lot of similarity between highly vegetated urban areas and oriented urban targets with that of actual vegetation. This similarity between some urban areas and vegetation leads to misclassification of the urban area into forest cover. The present work is a precursor study for the dual-frequency L and S-band NASA-ISRO Synthetic Aperture Radar (NISAR) mission and aims at minimizing the misclassification of such highly vegetated and oriented urban targets into vegetation class with the help of deep learning. In this study, three machine learning algorithms Random Forest (RF), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM) have been implemented along with a deep learning model DeepLabv3+ for semantic segmentation of Polarimetric SAR (PolSAR) data. It is a general perception that a large dataset is required for the successful implementation of any deep learning model but in the field of SAR based remote sensing, a major issue is the unavailability of a large benchmark labeled dataset for the implementation of deep learning algorithms from scratch. In current work, it has been shown that a pre-trained deep learning model DeepLabv3+ outperforms the machine learning algorithms for land use and land cover (LULC) classification task even with a small dataset using transfer learning. The highest pixel accuracy of 87.78% and overall pixel accuracy of 85.65% have been achieved with DeepLabv3+ and Random Forest performs best among the machine learning algorithms with overall pixel accuracy of 77.91% while SVM and KNN trail with an overall accuracy of 77.01% and 76.47% respectively. The highest precision of 0.9228 is recorded for the urban class for semantic segmentation task with DeepLabv3+ while machine learning algorithms SVM and RF gave comparable results with a precision of 0.8977 and 0.8958 respectively.
- Research Article
68
- 10.3390/app10155075
- Jul 23, 2020
- Applied Sciences
Machine learning algorithms are crucial for crop identification and mapping. However, many works only focus on the identification results of these algorithms, but pay less attention to their classification performance and mechanism. In this paper, based on Google Earth Engine (GEE), Sentinel-2 10 m resolution images during a specific phenological period of winter wheat were obtained. Then, support vector machine (SVM), random forest (RF), and classification and regression tree (CART) machine learning algorithms were employed to identify and map winter wheat in a large-scale area. The hyperparameters of the three machine learning algorithms were tuned by grid search and the 5-fold cross-validation method. The classification performance of the three machine learning algorithms were compared, the results of which demonstrate that SVM achieves best performance in identifying winter wheat, and its overall accuracy (OA), user’s accuracy (UA), producer’s accuracy (PA), and kappa coefficient (Kappa) are 0.94, 0.95, 0.95, and 0.92, respectively. Moreover, 50 various combinations of training and validation sets were used to analyze the generalization ability of the algorithms, and the results show that the average OA of SVM, RF, and CART are 0.93, 0.92, and 0.88, respectively, thus indicating that SVM and RF are more robust than CART. To further explore the sensitivity of SVM, RF, and CART to variations of the algorithm parameters—namely, (C and gamma), (tree and split), and (maxD and minSP)—we employed the grid search method to iterate these parameters, respectively, and to analyze the effect of these parameters on the accuracy scores and classification residuals. It was found that with the change of (C and gamma) in (0.01~1000), SVM’s maximum variation of accuracy score is up to 0.63, and the maximum variation of residuals is 76,215 km2. We concluded that SVM is sensitive to the parameters (C and gamma) and presents a positive correlation. When the parameters (tree and split) change between (100~600) and (1~6), respectively, the RF’s maximum variation of accuracy score is 0.08, and the maximum variation of residuals is 1157 km2, indicating that RF is low in sensitivity toward the parameters (tree and split). When the parameters (maxD and minSP) are between (10~60), the maximum accuracy change value is 0.06, and the maximum variation of residuals is 6943 km2. Therefore, compared to RF, CART is sensitive to the parameters (maxD and minSP) and has poor robustness. In general, under the conditions of the hyperparameters, SVM and RF exhibit optimal classification performance, while CART has relatively inferior performance. Meanwhile, SVM, RF, and CART have different sensitivities toward the algorithm parameters; that is, SVM and CART are more sensitive to the algorithm parameters, while RF has low sensitivity toward changes in the algorithm parameters. The different parameters cause great changes in the accuracy scores and residuals, so it is necessary to determine the algorithm hyperparameters. Generally, default parameters can be used to achieve crop classification, but we recommend the enumeration method, similar to grid search, as a practical way to improve the classification performance of the algorithm if the best classification effect is expected.
- Conference Article
3
- 10.5753/ise.2023.235840
- Sep 26, 2023
In today’s fast-paced software industry, understanding and managing Technical Debt (TD) is crucial for software development. TD can compromise the long-term quality of software systems. The occurrence of TD is commonly reported and discussed by practitioners on Question and Answers (Q&A) platforms, such as Stack Overflow (SO). Data from Q&A platforms has been leveraged by the TD research community, most prominently regarding knowledge extraction. However, manual analyses of such data not only require considerable effort but also suffer from biases. Hence, this paper aims to propose an automated approach for identifying and classifying types of TD in SO discussions using machine learning (ML) and natural language processing. We divided our methodology into four main steps: i) data preprocessing, ii) application of natural language processing, iii) application of ML algorithms, and iv) computing the evaluation metrics for the proposed models. Our results indicate that ML algorithms have the potential to be successfully applied to automatically identify and classify TD types on SO discussions.We achieved a recall of 85% for test debt and a precision of 78% for design debt. Furthermore, the results of automated TD identification on SO benefit the software development community by enhancing solution quality, raising awareness of best practices, and facilitating collaboration among developers. This leads to more efficient development and the promotion of consistent standards. We make our entire dataset and pre-trained models available to encourage future research directions.
- Conference Article
- 10.1109/eeccis49483.2020.9263431
- Aug 26, 2020
A collection of constraints on stack overflow can be used as material for evaluating the quality of software so that developers can improve the quality of the software utilizing text mining. the study aims to determine the usability aspects of software based on the classification of questions on stack overflow to ease the task of developers in evaluating the quality of the software. This research has several processes, namely: first, we do preprocess data, then data that has been preprocessed, is classified to get data that reflects the usability attribute, data that includes usability attributes will be calculated inverse document frequency and sorting to get the 20 highest term scores which are an aspect of the usability attribute. This study succeeded in getting 20 aspects of the usability of the results of the question extraction process using the classification process by comparing the five classification methods, namely: Naive Bayes, Support Vector Machine, Neural Networks, Logistic Regression, and Random Forest. The best accuracy results are obtained when using the Naive Bayes method with a value of 70 percent, a usability grade precision of 71 percent, and a recall value of 67%. In the Non-Usability class, the values of precision and recall are 70 percent and 74 percent, respectively.
- Research Article
95
- 10.1038/s41598-023-40564-0
- Aug 19, 2023
- Scientific Reports
Accurate spatial information on Land use and land cover (LULC) plays a crucial role in city planning. A widely used method of obtaining accurate LULC maps is a classification of the categories, which is one of the challenging problems. Attempts have been made considering spectral (Sp), statistical (St), and index-based (Ind) features in developing LULC maps for city planning. However, no work has been reported to automate LULC performance modeling for their robustness with machine learning (ML) algorithms. In this paper, we design seven schemes and automate the LULC performance modeling with six ML algorithms-Random Forest, Support Vector Machine with Linear kernel, Support Vector Machine with Radial basis function kernel, Artificial Neural Network, Naïve Bayes, and Generalised Linear Model for the city of Melbourne, Australia on Sentinel-2A images. Experimental results show that the Random Forest outperforms remaining ML algorithms in the classification accuracy (0.99) on all schemes. The robustness and statistical analysis of the ML algorithms (for example, Random Forest imparts over 0.99 F1-score for all five categories and p value le 0.05 from Wilcoxon ranked test over accuracy measures) against varying training splits demonstrate the effectiveness of the proposed schemes. Thus, providing a robust measure of LULC maps in city planning.
- Research Article
13
- 10.1007/s10021-024-00928-7
- Sep 9, 2024
- Ecosystems
Andean highland soils contain significant quantities of soil organic carbon (SOC); however, more efforts still need to be made to understand the processes behind the accumulation and persistence of SOC and its fractions. This study modeled SOC variables—SOC, refractory SOC (RSOC), and the 13C isotope composition of SOC (δ13CSOC)—using machine learning (ML) algorithms in the Central Andean Highlands of Peru, where grasslands and wetlands (“bofedales”) dominate the landscape surrounded by Junin National Reserve. A total of 198 soil samples (0.3 m depth) were collected to assess SOC variables. Four ML algorithms—random forest (RF), support vector machine (SVM), artificial neural networks (ANNs), and eXtreme gradient boosting (XGB)—were used to model SOC variables using remote sensing data, land-use and land-cover (LULC, nine categories), climate topography, and sampled physical–chemical soil variables. RF was the best algorithm for SOC and δ13CSOC prediction, whereas ANN was the best to model RSOC. “Bofedales” showed 2–3 times greater SOC (11.2 ± 1.60%) and RSOC (1.10 ± 0.23%) and more depleted δ13CSOC (− 27.0 ± 0.44 ‰) than other LULC, which reflects high C persistent, turnover rates, and plant productivity. This highlights the importance of “bofedales” as SOC reservoirs. LULC and vegetation indices close to the near-infrared bands were the most critical environmental predictors to model C variables SOC and δ13CSOC. In contrast, climatic indices were more important environmental predictors for RSOC. This study’s outcomes suggest the potential of ML methods, with a particular emphasis on RF, for mapping SOC and its fractions in the Andean highlands.
- Research Article
4
- 10.1080/03772063.2023.2192000
- Apr 12, 2023
- IETE Journal of Research
As the world’s population grows, the agricultural sectors are destined to increase crop production and security. Improving crop yield by using advanced technologies gives remarkable growth to the economy of the country. Agriculture provides over 20% of India's GDP. Using machine learning algorithms, the crop yield can be predicted which is useful to the farmers to plan the cultivation beforehand. In this work, various machine learning (ML) algorithms are applied to predict the yield of ‘rice and sorghum (jowar)’ and a novel weighted feature approach with a combination of Support Vector Machine (SVM) and Random Forest (RF) is proposed for two Indian seasons. RF is used to select training data at random, and the learning rate approach from deep learning concepts is implemented to add random weights to each parameter; the SVM model is then trained using the weighted training data. The best weights are again applied for the whole data to implement the SVM and RF algorithms. The weighted feature hybrid model is compared with SVM, RF, Decision tree, Naive Bayes, and k-Nearest Neighbor algorithms. RF-based regression method is also implemented and its ability to predict the crop yield has been discussed based on its performance metrics. The results show that the proposed weighted feature hybrid SVM-RF model gives the best accuracy of 90% when compared with the traditional algorithms. Also, the performances of various ML algorithms for crop yield prediction are analysed and cross-validation of the models is performed and compared, which improved the accuracy by 8-10%.
- Research Article
102
- 10.3390/rs11161927
- Aug 17, 2019
- Remote Sensing
Wetlands are one of the world’s most important ecosystems, playing an important role in regulating climate and protecting the environment. However, human activities have changed the land cover of wetlands, leading to direct destruction of the environment. If wetlands are to be protected, their land cover must be classified and changes to it monitored using remote sensing technology. The random forest (RF) machine learning algorithm, which offers clear advantages (e.g., processing feature data without feature selection and preferable classification result) for high spatial image classification, has been used in many study areas. In this research, to verify the effectiveness of this algorithm for remote sensing image classification of coastal wetlands, two types of spatial resolution images of the Linhong Estuary wetland in Lianyungang—Worldview-2 and Landsat-8 images—were used for land cover classification using the RF method. To demonstrate the preferable classification accuracy of the RF algorithm, the support vector machine (SVM) and k-nearest neighbor (k-NN) methods were also used to classify the same area of land cover for comparison with the results of RF classification. The study results showed that (1) the overall accuracy of the RF method reached 91.86%, higher than the SVM and k-NN methods by 4.68% and 4.72%, respectively, for Worldview-2 images; (2) at the same time, the classification accuracies of RF, SVM, and k-NN were 86.61%, 79.96%, and 77.23%, respectively, for Landsat-8 images; (3) for some land cover types having only a small number of samples, the RF algorithm also achieved better classification results using Worldview-2 and Landsat-8 images, and (4) the addition texture features could improve the classification accuracy of the RF method when using Worldview-2 images. Research indicated that high-resolution remote sensing images are more suitable for small-scale land cover classification image and that the RF algorithm can provide better classification accuracy and is more suitable for coastal wetland classification than the SVM and k-NN algorithms are.
- Conference Article
19
- 10.1109/msr.2019.00047
- May 1, 2019
Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.
- Conference Article
3
- 10.1109/icsess47205.2019.9040720
- Oct 1, 2019
Context: The recent developments made during the last decade or two in requirements engineering (RE) methods have seen a rise in using different machine-learning (ML) algorithms to solve some complex RE problems. One such problem is identifying and classifying software requirements on Stack Overflow (SO). The suitability of ML-based techniques to this tackle problem has shown convincing results, much better than those generated by some traditional natural language processing (NLP) techniques. Nevertheless, a comprehensive and systematic comprehension of these ML based techniques is still deficient. Objective: To identify and classify the type of ML algorithms used for identifying software requirements on SO. Method: This article reports systematic literature review (SLR) gathering evidence published up to August, 2019. Results: This study identified 1073 published papers related to RE and SO. Only 12 primary papers were selected. The data extraction process revealed that; 1) Latent Dirichlet Allocation (LDA) topic modeling is the most widely used ML algorithm in the selected studies, and 2) Precision and recall are the most commonly used evaluation method to measure the performance of these ML algorithms. Conclusion: The SLR finds that while ML algorithms have great potential in the identification of RE on SO, they face some open issues that will ultimately affect their performance and practical application. The SLR calls for the collaboration between RE and ML researchers, to tackle the open issues facing the development of real-world ML systems.
- Research Article
8
- 10.1016/j.heliyon.2023.e20242
- Sep 1, 2023
- Heliyon
Sonic logs are essential for determining important reservoir properties such as porosity, permeability, lithology, and elastic properties, among others, and yet may be missing in some well logging suites due to high acquisition costs, borehole washout, tool damage, poor tool calibration, or faulty logging instruments. This study aims at predicting the compressional sonic log from commonly acquired logs (gamma ray, resistivity, density, and neutron-porosity) in the Tano basin of Ghana using Support Vector Machines (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) Machine Learning (ML) algorithms and comparing the performances of the algorithms. The algorithms were trained with 70% of the data from two wells and tested using the remaining 30% of the data from the wells after cross-validation. Subsequently, they were applied to the data from a third well to predict the sonic log in the well. The performances of the algorithms were assessed with five statistical tools: coefficient of determination (R2), adjusted R2, Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). All three algorithms successfully predicted the compressional sonic log (DT). XGBoost demonstrated the highest prediction accuracy, with R2 of 0.9068 and the least errors. RF exhibited the next highest accuracy, with R2 being 0.85478, while SVM had R2 of 0.66591. Therefore, the ensemble algorithms (XGBoost and RF) proved to be more accurate than the non-ensemble algorithm (SVM) in this study. The outcome of the study will accelerate and enhance the understanding of oil and gas fields with few or no compressional sonic logs. To the best of the authors’ knowledge, this is the first study to have predicted the compressional sonic log from well data (logs) in a Ghanaian sedimentary basin using machine learning algorithms, and only a few such studies have been conducted in the whole West African sub-region.
- Research Article
3
- 10.3389/fpubh.2022.1031147
- Nov 17, 2022
- Frontiers in Public Health
ObjectiveTracking global health funding is a crucial but time consuming and labor-intensive process. This study aimed to develop a framework to automate the tracking of global health spending using natural language processing (NLP) and machine learning (ML) algorithms. We used the global common goods for health (CGH) categories developed by Schäferhoff et al. to design and evaluate ML models.MethodsWe used data curated by Schäferhoff et al., which tracked the official development assistance (ODA) disbursements to global CGH for 2013, 2015, and 2017, for training and validating the ML models. To process raw text, we implemented different NLP techniques, such as removing stop words, lemmatization, and creation of synthetic text, to balance the dataset. We used four supervised learning ML algorithms—random forest (RF), XGBOOST, support vector machine (SVM), and multinomial naïve Bayes (MNB) (see Glossary)—to train and test the pre-coded dataset, and applied the best model on dataset that hasn't been manually coded to predict the financing for CGH in 2019.ResultsAfter we trained the machine on the training dataset (n = 10,534), the weighted average F1-scores (a measure of a ML model's performance) on the testing dataset (n = 2,634) ranked 0.79–0.83 among four models, and the RF model had the best performance (F1-score = 0.83). The predicted total donor support for CGH projects by the RF model was $2.24 billion across 3 years, which was very close to the finding of $2.25 billion derived from coding and classification by humans. By applying the trained RF model on the 2019 dataset, we predicted that the total funding for global CGH was about $2.7 billion for 730 CGH projects.ConclusionWe have demonstrated that NLP and ML can be a feasible and efficient way to classify health projects into different global CGH categories, and thus track health funding for CGH routinely using data from publicly available databases.
- Research Article
3
- 10.3389/fmed.2025.1554579
- May 19, 2025
- Frontiers in Medicine
BackgroundGastrointestinal bleeding (GIB) is a common complication following Type A aortic dissection (TAAD) surgery, significantly impacting prognosis and increasing mortality risk. This study developed and validated a predictive model based on machine learning (ML) algorithms to enable early and precise assessment of postoperative GIB risk in TAAD patients.MethodsMedical records of patients who underwent TAAD surgery at Shanxi Bethune Hospital from January 2019 to September 2024 were retrospectively collected. Predictors were screened using LASSO regression, and four ML algorithms—Random Forest (RF), K-nearest neighbor (KNN), Support Vector Machines (SVM), and Decision Tree (DT)—were employed to construct models for predicting postoperative GIB risk. The dataset was divided into training and validation sets in a 7:3 ratio. Predictive performance was evaluated and compared using Receiver Operating Characteristic (ROC) curves and DeLong tests. Calibration curves and decision curve analysis (DCA) were used to assess model calibration and clinical utility. The SHapley Additive exPlanation (SHAP) algorithm was applied for interpretability analysis. This study adhered to the “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence (TRIPOD+AI) guidelines.”ResultsA total of 525 TAAD patients were included, with 63 (12%) developing GIB. Nine predictors were selected via LASSO regression for model construction. The RF model outperformed the SVM, KNN, and DT models in predicting postoperative GIB, with areas under the ROC curve (AUC) of 0.933, 0.892, 0.902, and 0.768, respectively, showing statistically significant differences (DeLong test, P < 0.05). Calibration curves and DCA further confirmed the RF model’s excellent calibration and clinical utility. SHAP analysis identified the three most influential clinical features on the RF model’s output: duration of mechanical ventilation (MV), Time to aortic occlusion, and red blood cell (RBC) transfusion.ConclusionThe machine learning-based predictive model effectively assesses postoperative GIB risk in TAAD patients, aiding healthcare providers in early identification of risk factors and implementation of targeted preventive strategies.