Reasoning About Human Values in GitHub Issues: What Can a Large Language Model Reveal?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The recent U.S. Senate hearing on child safety failures in major social media platforms, gender bias in Amazon’s recruitment algorithms, and privacy lawsuits against Google over vague location tracking illustrate how software can neglect human values with societal and economic consequences. Addressing such oversights requires understanding how design choices impact values. GitHub Issues provide a key venue where these choices are documented: through feature requests, improvements, and bug reports, they capture both technical complexities and the tradeoffs reflecting stakeholders’ beliefs. They thus offer insights into how values such as privacy, fairness, and inclusivity influence and are shaped by software design. Yet inferring value alignments, often implied rather than explicit, demands complex reasoning beyond keyword searches and cannot be done manually at scale. In this paper, we use a Large Language Model (LLM) to infer (detect and explain) the alignment of GitHub Issues with human values across three open-source projects, and evaluate the accuracy of the findings through human evaluation.

Similar Papers
  • Research Article
  • Cite Count Icon 27
  • 10.1287/mnsc.2023.03014
Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise
  • Oct 15, 2024
  • Management Science
  • Zenan Chen + 1 more

Since the launch of ChatGPT in December 2022, large language models (LLMs) have been rapidly adopted by businesses to assist users in a wide range of open-ended tasks, including creative work. Although the versatility of LLM has unlocked new ways of human-artificial intelligence collaboration, it remains uncertain how LLMs should be used to enhance business outcomes. To examine the effects of human-LLM collaboration on business outcomes, we conducted an experiment where we tasked expert and nonexpert users to write an ad copy with and without the assistance of LLMs. Here, we investigate and compare two ways of working with LLMs: (1) using LLMs as “ghostwriters,” which assume the main role of the content generation task, and (2) using LLMs as “sounding boards” to provide feedback on human-created content. We measure the quality of the ads using the number of clicks generated by the created ads on major social media platforms. Our results show that different collaboration modalities can result in very different outcomes for different user types. Using LLMs as sounding boards enhances the quality of the resultant ad copies for nonexperts. However, using LLMs as ghostwriters did not provide significant benefits and is, in fact, detrimental to expert users. We rely on textual analyses to understand the mechanisms, and we learned that using LLMs as ghostwriters produces an anchoring effect, which leads to lower-quality ads. On the other hand, using LLMs as sounding boards helped nonexperts achieve ad content with low semantic divergence to content produced by experts, thereby closing the gap between the two types of users. This paper was accepted by D. J. Wu, information systems. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.03014 .

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3678170
Can GitHub Issues Help in App Review Classifications?
  • Nov 23, 2024
  • ACM Transactions on Software Engineering and Methodology
  • Yasaman Abedini + 1 more

App reviews reflect various user requirements that can aid in planning maintenance tasks. Recently, proposed approaches for automatically classifying user reviews rely on machine learning algorithms. A previous study demonstrated that models trained on existing labeled datasets exhibit poor performance when predicting new ones. Therefore, a comprehensive labeled dataset is essential to train a more precise model. In this paper, we propose a novel approach that assists in augmenting labeled datasets by utilizing information extracted from an additional source, GitHub issues, that contains valuable information about user requirements. First, we identify issues concerning review intentions (bug reports, feature requests, and others) by examining the issue labels. Then, we analyze issue bodies and define 19 language patterns for extracting targeted information. Finally, we augment the manually labeled review dataset with a subset of processed issues through the Within-App , Within-Context , and Between-App Analysis methods. We conducted several experiments to evaluate the proposed approach. Our results demonstrate that using labeled issues for data augmentation can improve the F1-score to 6.3 in bug reports and 7.2 in feature requests. Furthermore, we identify an effective range of 0.3 to 0.7 for the auxiliary volume, which provides better performance improvements.

  • Research Article
  • 10.1145/3736408
Exploring Fine-Grained Bug Report Categorization with Large Language Models and Prompt Engineering: An Empirical Study
  • May 20, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Anil Koyuncu

Accurate classification of issues is essential for effective project management and timely responses, as the volume of issue reports continues to grow. Manual classification is labor-intensive and error-prone, necessitating automated solutions. While large language models (LLMs) show promise in automated issue labeling, most research focuses on broad categorization (e.g., bugs, feature requests), with limited attention to fine-grained categorization. Understanding specific bug types is crucial, as different bugs require tailored resolution strategies. This study addresses this gap by evaluating LLMs and prompt engineering strategies for fine-grained bug report categorization. We analyze 221,184 fine-grained bug report category labels generated by selected LLMs using various prompt engineering strategies for 1,024 bug reports. We examine how LLMs and prompt engineering influence output characteristics, control over outputs, and categorization performance. Our findings highlight that LLMs and prompt engineering significantly impact output consistency and classification capability, with some yielding consistent results and others introducing variability. Based on these findings, we analyze the agreements and disagreements between LLM-generated labels and human annotations to assess category correctness. Our results suggest that examining label consistency and discrepancies can serve as a complementary method for validating bug report categories, identifying unclear reports, and detecting misclassifications in human annotations.

  • Research Article
  • 10.1145/3729356
A Knowledge Enhanced Large Language Model for Bug Localization
  • Jun 19, 2025
  • Proceedings of the ACM on Software Engineering
  • Yue Li + 7 more

A significant number of bug reports are generated every day as software systems continue to develop. Large Language Models (LLMs) have been used to correlate bug reports with source code to locate bugs automatically. The existing research has shown that LLMs are effective for bug localization and can increase software development efficiency. However, these studies still have two limitations. First, these models fail to capture context information about bug reports and source code. Second, these models are unable to understand the domain-specific expertise inherent to particular projects, such as version information in projects that are composed of alphanumeric characters without any semantic meaning. To address these challenges, we propose a K nowledge E nhanced P re- T rained model using project documents and historical code, called KEPT , for bug localization. Project documents record, revise, and restate project information that provides rich semantic information about those projects. Historical code contains rich code semantic information that can enhance the reasoning ability of LLMs. Specifically, we construct knowledge graphs from project documents and source code. Then, we introduce knowledge graphs to the LLM through soft-position embedding and visible matrices, enhancing its contextual and professional reasoning ability. To validate our model, we conducted a series of experiments on seven open-source software projects with over 6,000 bug reports. Compared with the traditional model (Locus), KEPT performs better by 33.2% to 59.5% in terms of mean reciprocal rank, mean average precision, and Top@N. Compared with the best-performing non-commercial LLM (CodeT5), KEPT achieves an improvement of 36.6% to 63.7%. Compared to the state-of-the-art commercial LLM developed by OpenAI, called text-embedding-ada-002 , KEPT achieves an average improvement of 7.8% to 17.4%. The results indicate that introducing knowledge graphs contributes to enhance the effectiveness of the LLM in bug localization.

  • Research Article
  • Cite Count Icon 67
  • 10.1016/j.scico.2020.102598
Predicting issue types on GitHub
  • Dec 30, 2020
  • Science of Computer Programming
  • Rafael Kallis + 3 more

Predicting issue types on GitHub

  • Conference Article
  • Cite Count Icon 15
  • 10.1145/3540250.3558934
ITiger: an automatic issue title generation tool
  • Nov 7, 2022
  • Ting Zhang + 5 more

In both commercial and open-source software, bug reports or issues are used\nto track bugs or feature requests. However, the quality of issues can differ a\nlot. Prior research has found that bug reports with good quality tend to gain\nmore attention than the ones with poor quality. As an essential component of an\nissue, title quality is an important aspect of issue quality. Moreover, issues\nare usually presented in a list view, where only the issue title and some\nmetadata are present. In this case, a concise and accurate title is crucial for\nreaders to grasp the general concept of the issue and facilitate the issue\ntriaging. Previous work formulated the issue title generation task as a\none-sentence summarization task. A sequence-to-sequence model was employed to\nsolve this task. However, it requires a large amount of domain-specific\ntraining data to attain good performance in issue title generation. Recently,\npre-trained models, which learned knowledge from large-scale general corpora,\nhave shown much success in software engineering tasks.\n In this work, we make the first attempt to fine-tune BART, which has been\npre-trained using English corpora, to generate issue titles. We implemented the\nfine-tuned BART as a web tool named iTiger, which can suggest an issue title\nbased on the issue description. iTiger is fine-tuned on 267,094 GitHub issues.\nWe compared iTiger with the state-of-the-art method, i.e., iTAPE, on 33,438\nissues. The automatic evaluation shows that iTiger outperforms iTAPE by 29.7%,\n50.8%, and 34.1%, in terms of ROUGE-1, ROUGE-2, ROUGE-L F1-scores. The manual\nevaluation also demonstrates the titles generated by BART are preferred by\nevaluators over the titles generated by iTAPE in 72.7% of cases. Besides, the\nevaluators deem our tool as useful and easy-to-use. They are also interested to\nuse our tool in the future.\n

  • Research Article
  • Cite Count Icon 69
  • 10.1038/s41598-024-55686-2
Bias of AI-generated content: an examination of news produced by large language models
  • Mar 4, 2024
  • Scientific reports
  • Xiao Fang + 5 more

Large language models (LLMs) have the potential to transform our lives and work through the content they generate, known as AI-Generated Content (AIGC). To harness this transformation, we need to understand the limitations of LLMs. Here, we investigate the bias of AIGC produced by seven representative LLMs, including ChatGPT and LLaMA. We collect news articles from The New York Times and Reuters, both known for their dedication to provide unbiased news. We then apply each examined LLM to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the AIGC produced by the LLM by comparing the AIGC and the original news articles. We further analyze the gender bias of each LLM under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. Our study reveals that the AIGC produced by each examined LLM demonstrates substantial gender and racial biases. Moreover, the AIGC generated by each LLM exhibits notable discrimination against females and individuals of the Black race. Among the LLMs, the AIGC generated by ChatGPT demonstrates the lowest level of bias, and ChatGPT is the sole model capable of declining content generation when provided with biased prompts.

  • Research Article
  • Cite Count Icon 72
  • 10.1109/tse.2018.2864217
Chaff from the Wheat: Characterizing and Determining Valid Bug Reports
  • Sep 7, 2018
  • IEEE Transactions on Software Engineering
  • Yuanrui Fan + 3 more

Developers use bug reports to triage and fix bugs. When triaging a bug report, developers must decide whether the bug report is valid (i.e., a real bug). A large amount of bug reports are submitted every day, with many of them end up being invalid reports. Manually determining valid bug report is a difficult and tedious task. Thus, an approach that can automatically analyze the validity of a bug report and determine whether a report is valid can help developers prioritize their triaging tasks and avoid wasting time and effort on invalid bug reports. In this study, motivated by the above needs, we propose an approach which can determine whether a newly submitted bug report is valid . Our approach first extracts 33 features from bug reports. The extracted features are grouped along 5 dimensions, i.e., reporter experience, collaboration network, completeness, readability and text. Based on these features, we use a random forest classifier to identify valid bug reports. To evaluate the effectiveness of our approach, we experiment on large-scale datasets containing a total of 560,697 bug reports from five open source projects (i.e., Eclipse, Netbeans, Mozilla, Firefox and Thunderbird). On average, across the five datasets, our approach achieves an F1-score for valid bug reports and F1-score for invalid ones of 0.74 and 0.67, respectively. Moreover, our approach achieves an average AUC of 0.81. In terms of AUC and F1-scores for valid and invalid bug reports, our approach statistically significantly outperforms two baselines using features that are proposed by Zanetti et al. [104] . We also study the most important features that distinguish valid bug reports from invalid ones. We find that the textual features of a bug report and reporter's experience are the most important factors to distinguish valid bug reports from invalid ones.

  • Research Article
  • Cite Count Icon 43
  • 10.1007/s10664-018-9643-4
Preventing duplicate bug reports by continuously querying bug reports
  • Aug 20, 2018
  • Empirical Software Engineering
  • Abram Hindle + 1 more

Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuously querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuously querying bug reports allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures can address this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 42
  • 10.3390/ijerph17114172
The Advertising Policies of Major Social Media Platforms Overlook the Imperative to Restrict the Exposure of Children and Adolescents to the Promotion of Unhealthy Foods and Beverages.
  • Jun 1, 2020
  • International Journal of Environmental Research and Public Health
  • Gary Sacks + 1 more

There have been global calls to action to protect children (aged <18) from exposure to the marketing of unhealthy foods and beverages (‘unhealthy foods’). In this context, the rising popularity of social media, particularly amongst adolescents, represents an important focus area. This study aimed to examine the advertising policies of major global social media platforms related to the advertising of unhealthy foods, and to identify opportunities for social media platforms to take action. We conducted a desk-based review of the advertising policies of the 16 largest social media platforms globally. We examined their publicly available advertising policies related to food and obesity, as well as in relation to other areas impacting public health. The advertising policies for 12 of the selected social media platforms were located. None of these platforms adopted comprehensive restrictions on the advertising of unhealthy foods, with only two platforms having relevant (but very limited) policies in the area. In comparison, 11 of the 12 social media platforms had policies restricting the advertising of alcohol, tobacco, gambling, and/or weight loss. There is, therefore, an opportunity for major social media platforms to voluntarily restrict the exposure of children to the marketing of unhealthy foods, which can contribute to efforts to improve populations’ diets.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.knosys.2024.112588
KnowBug: Enhancing Large language models with bug report knowledge for deep learning framework bug prediction
  • Oct 10, 2024
  • Knowledge-Based Systems
  • Chenglong Li + 5 more

KnowBug: Enhancing Large language models with bug report knowledge for deep learning framework bug prediction

  • PDF Download Icon
  • Preprint Article
  • Cite Count Icon 4
  • 10.7287/peerj.preprints.2373v1
Stopping duplicate bug reports before they start with Continuous Querying for bug reports
  • Aug 18, 2016
  • Abram Hindle

Bug deduplication is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10\% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuous querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuous querying allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures show some promise for addressing this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10676-024-09818-x
Possibilities and challenges in the moral growth of large language models: a philosophical perspective
  • Dec 20, 2024
  • Ethics and Information Technology
  • Guoyu Wang + 9 more

With the rapid expansion of parameters in large language models (LLMs) and the application of Reinforcement Learning with Human Feedback (RLHF), there has been a noticeable growth in the moral competence of LLMs. However, several questions warrant further exploration: Is it really possible for LLMs to fully align with human values through RLHF? How can the current moral growth be philosophically contextualized? We identify similarities between LLMs’ moral growth and Deweyan ethics in terms of the discourse of human moral development. We then attempt to use Dewey’s theory on an experimental basis to examine and further explain the extent to which the current alignment pathway enables the development of LLMs. A beating experiment serves as the foundational case for analyzing LLMs’ moral competence across various parameters and stages, including basic moral cognition, moral dilemma judgment, and moral behavior. The results demonstrate that the moral competence of the GPT series has seen a significant improvement, and Dewey’s Impulse-Habit-Character theory of moral development can be used to explain this: the moral competence of LLMs has been enhanced through experience-based learning, supported by human feedback. Nevertheless, LLMs’ moral development through RLHF remains constrained and does not reach the character stage described by Dewey, possibly due to their lack of self-consciousness. This fundamental difference between humans and LLMs underscores both the limitations of LLMs’ moral growth and the challenges of applying RLHF for AI alignment. It also emphasizes the need for external societal governance and legal regulation.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aies.v8i1.36578
Ethical Classification of Non-Coding Contributions in Open-Source Projects via Large Language Models
  • Oct 15, 2025
  • Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
  • Sergio Cobos + 1 more

The development of Open-Source Software (OSS) is not only a technical challenge, but also a social one due to the diverse mixture of contributors. To this aim, social-coding platforms, such as GitHub, provide the infrastructure needed to host and develop the code, but also the support for enabling the community's collaboration, which is driven by non-coding contributions, such as issues (i.e., change proposals or bug reports) or comments to existing contributions. As with any other social endeavor, this development process faces ethical challenges, which may put in risk the project's sustainability. To foster a productive and positive environment, OSS projects are increasingly deploying codes of conduct, which define rules to ensure a respectful and inclusive participatory environment, with the Contributor Covenant being the main model to follow. However, monitoring and enforcing these codes of conduct is a challenging task, due to the limitations of current approaches. In this paper, we propose an approach to classify the ethical quality of non-coding contributions in OSS projects by relying on Large Language Models (LLM), a promising technology for text classification tasks. We defined a set of ethical metrics based on the Contributor Covenant and developed a classification approach to assess ethical behavior in OSS non-coding contributions, using prompt engineering to guide the model's output.

  • Research Article
  • 10.4233/uuid:5890f1cb-2a90-4bfa-83ba-81b602dca0d5
Quality of Just-in-Time Requirements : Just-Enough and Just-in-Time
  • Apr 22, 2016
  • Petra Heck

The goal of this thesis was to obtain a deeper understanding of the notion of quality for Just-in-Time (JIT) Requirements. JIT requirements are the opposite of up-front requirements. JIT requirements are not analyzed or defined until they are needed meaning that development is allowed to begin with incomplete requirements. We started our analysis by investigating one specific format of JIT requirements: open source feature requests. We discovered that in open source projects there is the problem of many duplicate feature requests. We found seven different categories for those duplicates. Analyzing the duplicates has also led to recommendations for manual search and creation of feature requests. Furthermore we have indicated possible tool support to avoid duplicate feature requests. One possibility for tool support is to visualize so-called feature request networks. For this one needs to have the links between feature requests. We show that it is possible to detect horizontal traceability links between feature requests by using a Vector Space Model with TF-IDF as a weighing scheme. We have determined the optimal preprocessing steps for the feature requests to be used for our text-based analysis. Using a more advanced technique like Latent Semantic Analysis takes significantly more processing time without yielding better results in the three open source projects that we have included in our experiment. Then we took a step back to look at quality criteria for JIT requirements in general. We developed a framework for those quality criteria and constructed a specific list of quality criteria for open source feature requests. We used agile user stories to indicate how the framework could be customized for other types of JIT requirements. We conducted interviews with agile practitioners to evaluate our framework. After their positive feedback we conducted a case study in three open source projects in which we used our framework to score the quality of feature requests. This case study also led to recommendations for practitioners working with feature requests. We conclude this thesis with a broader perspective on JIT requirements quality by presenting the results of a systematic literature review on quality criteria for agile requirements. This review resulted in a list of 28 quality criteria for JIT requirements, recommendations for practitioners working on quality assessment of agile requirements and a research agenda on quality of agile requirements. To conclude we claim that the quality of Just-in-Time Requirements can be characterized as ‘Just-Enough and Just-in-Time Quality’. Our framework can be used to define what Just-Enough and Just-in-Time mean for the specific JIT environment.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.