Exploring large language models’ responses to moral reasoning dilemmas

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT This study investigates how various large language models (LLMs) generate responses to moral reasoning dilemmas. It specifically examines LLM-generated responses using the Defining Issues Test (DIT-2), which measures abstract moral reasoning schemas, and the Intermediate Concepts Measure (ICM Educational leaders’ version), which assesses domain-specific professional moral reasoning. For DIT-2, Claude prioritizes the highest post-conventional moral reasoning, followed by Gemini Advanced and Gemini. For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score, followed by Claude 3.5 Sonnet and Gemini. The findings indicate that some LLMs can generate responses consistent with sophisticated moral reasoning patterns, producing scores comparable to or exceeding graduate-level human participants; however, no direct comparisons with human participants were made in this study. This study provides a methodological framework for guiding larger-scale research into AI-generated and human moral reasoning patterns.

Similar Papers
  • Research Article
  • 10.1609/aies.v8i3.36727
Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory
  • Oct 15, 2025
  • Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
  • Nicole Smith-Vaniz + 4 more

Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings — either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

  • Research Article
  • Cite Count Icon 4
  • 10.1073/pnas.2412015122
Large language models show amplified cognitive biases in moral decision-making
  • Jun 20, 2025
  • Proceedings of the National Academy of Sciences
  • Vanessa Cheung + 2 more

As large language models (LLMs) become more widely used, people increasingly rely on them to make or advise on moral decisions. Some researchers even propose using LLMs as participants in psychology experiments. It is, therefore, important to understand how well LLMs make moral decisions and how they compare to humans. We investigated these questions by asking a range of LLMs to emulate or advise on people's decisions in realistic moral dilemmas. In Study 1, we compared LLM responses to those of a representative U.S. sample (N = 285) for 22 dilemmas, including both collective action problems that pitted self-interest against the greater good, and moral dilemmas that pitted utilitarian cost-benefit reasoning against deontological rules. In collective action problems, LLMs were more altruistic than participants. In moral dilemmas, LLMs exhibited stronger omission bias than participants: They usually endorsed inaction over action. In Study 2 (N = 474, preregistered), we replicated this omission bias and documented an additional bias: Unlike humans, most LLMs were biased toward answering "no" in moral dilemmas, thus flipping their decision/advice depending on how the question is worded. In Study 3 (N = 491, preregistered), we replicated these biases in LLMs using everyday moral dilemmas adapted from forum posts on Reddit. In Study 4, we investigated the sources of these biases by comparing models with and without fine-tuning, showing that they likely arise from fine-tuning models for chatbot applications. Our findings suggest that uncritical reliance on LLMs' moral decisions and advice could amplify human biases and introduce potentially problematic biases.

  • Research Article
  • Cite Count Icon 56
  • 10.1111/j.1365-2929.2006.02391.x
Students' moral reasoning, Machiavellianism and socially desirable responding: implications for teaching ethics and research integrity
  • Feb 17, 2006
  • Medical Education
  • Darko Hren + 5 more

To investigate the relationship between psychological constructs related to professional and research integrity and moral reasoning among medical students. Medical students, 2nd-year (n = 208, 85.6% of 243 enrolled students), answered the moral reasoning test-defining issues test 2 (DIT2) and the Machiavellianism and Paulhus socially desirable responding (SDR) scales. Students had the highest score on the post-conventional schema of moral reasoning (mean +/- standard deviation, 35.2 +/- 11.6 of a possible 95) compared with personal interest (27.2 +/- 12.3) and maintaining norms schemae (29.2 +/- 11.5; P < 0.001, repeated-measures anova). Female students scored higher than their male collegues on post-conventional moral reasoning (37.6 +/- 11.0 versus 31.2 +/- 22.4, P < 0.001, independent-sample t-test). Of all 4 Machiavellianism subscales students scored highest on deceiving, where female students scored higher than their male colleagues (24.5 +/- 4.2 versus 22.9 +/- 5.1 of a possible 30; P = 0.037, independent-sample t-test). Female students also scored higher on the impression management subscale, whereas their male colleagues scored higher on the self-deception subscale of the Paulhus SDR scale. Moral reasoning scores were associated with cynicism, deceiving and flattering Machiavellianism scores, but not with Paulhus SDR scores. Multiple regression analysis showed the Machiavellianism amorality score as a significant negative predictor (beta = -0.183, P = 0.017) and female sex as a positive predictor (beta = 0.291, P < 0.001) for the post-conventional schema score on the DIT2. The Machiavellianism flattering score was a significant negative predictor for the personal interest schema score (beta = -0.215, P = 0.006). Although moral reasoning is generally seen as independent of variables related to personality, our study indicated that Machiavellianism, especially its amorality and flattering subscales, were associated with moral reasoning. These results have important implications for teaching ethics and the responsible conduct of research in different cultural and socio-economic settings.

  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 2
  • 10.1080/23736992.2025.2553146
Insights into Moral Reasoning of AI: A Comparative Study Between Humans and Large Language Models
  • Sep 10, 2025
  • Journal of Media Ethics
  • Srajal Bajpai + 2 more

This study investigates the moral reasoning capabilities of large language models (LLMs), focusing on biases and the extent to which outputs reflect training data patterns rather than genuine reasoning. Using the Moral Competence Test (MCT) and the Moral Foundations Questionnaire (MFQ), we compared responses from human participants and LLM-based chatbots like ChatGPT. MCT results show that humans consistently outperform LLMs, indicating higher moral competence. MFQ responses from LLMs emphasize harm/care and fairness/reciprocity, but under-represent loyalty, authority, and purity. This pattern suggests a data-proportionality effect, where moral emphasis mirrors the prevalence of certain values in training data. Additionally, fine-tuning methods such as reinforcement learning with human feedback may amplify specific moral norms. These imbalances could unintentionally shape users’ moral intuitions and societal norms when LLMs are widely deployed. Our findings underscore the need for continuous auditing and alignment to ensure that LLMs provide ethically balanced and socially responsible guidance in morally sensitive applications.

  • Dissertation
  • 10.22371/07.1991.008
The Moral Reasoning of Nurse Practitioners
  • Oct 22, 2021
  • Diane Viens

The purpose of this phenomenological study was to identify the moral dilemmas experienced by nurse practitioners in their clinical practice and to describe the essential features of moral reasoning utilized by the nurse practitioners to resolve the moral dilemma. The participants in the study were ten female volunteers who were currently employed as NPs in a variety of settings. Unstructured interviews were conducted with the participants and the qualitative data was analyzed using a nine step process. Five essential features of moral reasoning emerged through the process of data analysis: values, elements in the contextual framework for moral reasoning, influencing factors, recognizing the dilemma, and outcomes. The first essential feature, values were those ideals which motivated the participants in making decisions amid competing choices in any given situation. The next essential feature, elements in the contextual framework for moral reasoning described the environment in which the NP practiced, including other persons within that setting. Elements in the contextual framework for moral reasoning also described the nurse practitioner role and referred not only to the activities the NP performed, but also to the nurse-patient relationship. Influencing factors were those elements that changed the everyday, clinical practice of the NP into one which became a moral dilemma. Influencing factors impacted the setting, the participants within the setting, and were the factors taken under consideration in the decision making process. One or more of these influencing factors were catalytic in motivating the practitioner into making a decision about the dilemma. The catalysts emerged because of certain values which were held in high esteem by the participants. Two patterns of moral reasoning were identified: independent and lateral reasoning. The nurse practitioners who utilized the independent pattern of reasoning based their decision making on self-chosen values regardless of other influences present in the situation. Lateral reasoning was a mode of reasoning where the individual chose to defer the decision to others in the environment. The implications for nursing practice, education and research based on the findings in this study are discussed. Recommendations are proposed which include further research into the essential features of moral reasoning to determine whether the findings in this study can be generalized to other nurses. It is hoped that research studies such as this will advance the knowledge of nursing and other disciplines concerning moral reasoning and ethics.

  • Dissertation
  • 10.22371/07.1991.08
The Moral Reasoning of Nurse Practitioners
  • Jun 29, 2021
  • Diane Viens

The purpose of this phenomenological study was to identify the moral dilemmas experienced by nurse practitioners in their clinical practice and to describe the essential features of moral reasoning utilized by the nurse practitioners to resolve the moral dilemma. The participants in the study were ten female volunteers who were currently employed as NPs in a variety of settings. Unstructured interviews were conducted with the participants and the qualitative data was analyzed using a nine step process. Five essential features of moral reasoning emerged through the process of data analysis: values, elements in the contextual framework for moral reasoning, influencing factors, recognizing the dilemma, and outcomes. The first essential feature, values were those ideals which motivated the participants in making decisions amid competing choices in any given situation. The next essential feature, elements in the contextual framework for moral reasoning described the environment in which the NP practiced, including other persons within that setting. Elements in the contextual framework for moral reasoning also described the nurse practitioner role and referred not only to the activities the NP performed, but also to the nurse-patient relationship. Influencing factors were those elements that changed the everyday, clinical practice of the NP into one which became a moral dilemma. Influencing factors impacted the setting, the participants within the setting, and were the factors taken under consideration in the decision making process. One or more of these influencing factors were catalytic in motivating the practitioner into making a decision about the dilemma. The catalysts emerged because of certain values which were held in high esteem by the participants. Two patterns of moral reasoning were identified: independent and lateral reasoning. The nurse practitioners who utilized the independent pattern of reasoning based their decision making on self-chosen values regardless of other influences present in the situation. Lateral reasoning was a mode of reasoning where the individual chose to defer the decision to others in the environment. The implications for nursing practice, education and research based on the findings in this study are discussed. Recommendations are proposed which include further research into the essential features of moral reasoning to determine whether the findings in this study can be generalized to other nurses. It is hoped that research studies such as this will advance the knowledge of nursing and other disciplines concerning moral reasoning and ethics.

  • Research Article
  • Cite Count Icon 20
  • 10.1177/105382590202500205
The Influence of Challenge Course Participation on Moral and Ethical Reasoning
  • Jun 1, 2002
  • Journal of Experiential Education
  • Carol A Smith + 2 more

This study investigated the impact of a 15-week outdoor experiential program on the moral reasoning of college students. One hundred and ninety-six university students volunteered to participate in this study, which utilized Rest's (1979) Defining Issues Test (DIT). The DIT investigates how individuals arrive at making decisions, and formulates a “P” (Principled moral reasoning) score for each subject. The groups were found to be homogeneous in moral reasoning at the pretest (outdoor experiential x = 36.07; control x = 33.08; F = 0.05). There was a statistically significant difference on the posttest scores of the outdoor experiential program participants (x = 40.98) in relation to the control group (x = 34.14) (F = 3.84). The results of this study demonstrated that the outdoor experiential program participants were significantly different from the control group at posttest. It is postulated that even though improved moral reasoning was not a stated objective, the outdoor experiential students, through front-loading, reflection, critical thinking, problem solving, and adherence to the full value contract, did enhance their level of moral reasoning. Through the combined modeling of behavior and discussion, changes in behavior can occur. The nature of outdoor experiential programs seems well suited to positively influence moral and ethical reasoning.

  • Book Chapter
  • 10.4324/9780203122594-15
Moral reasoning in tax practice: the development of an assessment instrument
  • Mar 29, 2012
  • Elaine Doyle + 2 more

Ethics is an important issue in tax practice, with ethical dilemmas involving tax issues being identified by members of the American Institute of Certified Public Accountants as posing the most difficult ethical or moral problem for them (Finn et al. 1988: 607-9). Cognitive developmental psychologists believe that before an individual reaches a decision about how and whether to behave ethically in a specific situation, ethical or moral reasoning takes place at a cognitive level. The psychology ofmoral reasoning aims to understand how people think about moral dilemmas and the processes they use in approaching them. Kohlberg (1973) developed a six-stage model of moral reasoning based on concepts of social cooperation and justice. James Rest (1979a) subsequently developed a test, named the Defining Issues Test (DIT), which is based directly on Kohlberg’s model and measures moral reasoning. The DIT is “a broad, general measure of moral reasoning” (Fisher 1997: 143), acceptable in dealing with personal issues in a social context (Fraedrich et al. 1994). However, concern has been expressed that it does not, and cannot, fairly represent the reasoning used in facing ethical dilemmas in a business context (Trevino 1986, 1992; Weber 1990; Elm and Weber 1994; Fraedrich et al. 1994; Welton et al. 1994; Dellaportas et al. 2006). Investigating moral reasoning in a particular context, therefore, requires aninstrument that uses dilemmas from that context. This chapter describes the process of developing such an instrument, using a tax context-specific adaptation of Rest’s well-known and validated DIT, to examine the ethical reasoning of tax practitioners.1 The development of this instrument was part of a larger project which combined the newly developed context-specific instrument with the short version of the original DIT (to investigate moral reasoning in the social and tax contexts),and disseminated it to both tax practitioners and non-tax specialists (the control group) in order to examine the moral reasoning of tax practitioners relative to non-specialists in both contexts. The use of a control group addressed one of the gaps in prior DIT research carried out on professionals. A control group was considered critical in this study to allow the results to be interpreted in relation to tax practitioners. For example, if tax practitioners should reason differently in social and tax contexts this cannot be attributed to their professional status unless we know how those outside the profession behave; the result might just reflect the reasoning norms of society at large. The development of the tax-context instrument itself is the primary focus of thischapter.

  • Research Article
  • Cite Count Icon 5
  • 10.1177/0007650316675611
Does It Matter How One Assesses Moral Reasoning? Differences (Biases) in the Recognition Versus Formulation Tasks
  • Oct 25, 2016
  • Business &amp; Society
  • James Weber

Most business ethics scholars interested in understanding individual moral cognition or reasoning rely on the Defining Issues Test (DIT). They typically report that managers and business students exhibit a relatively high percentage of principled moral reasoning when resolving ethical dilemmas. This article applies neurocognitive processes and Bloom’s Taxonomy of Educational Objectives, and its more recent revision, as theoretical foundations to explore whether differences emerge when using a recognition of learning task, such as the DIT or similar instruments, versus a formulation of knowledge task, such as the Moral Judgment Interview or similar instruments, to assess individual moral reasoning. The data show that significantly different levels of moral reasoning are detected when using a recognition-based versus formulation-based moral reasoning instrument. As expected, the recognition-based approach (using a DIT-like instrument) reports an inflated, higher moral reasoning score for subjects compared with using a formulation-based instrument. Implications of these results for understanding an individual’s moral reasoning are discussed.

  • Research Article
  • Cite Count Icon 1
  • 10.1108/ce-12-2004-0003
Longitudinal Studies of Teacher Education Candidates’ Moral Reasoning and Related Promising Interventions
  • Dec 15, 2004
  • Journal of Research in Character Education
  • Alan J Reiman

Definitions of character are lacking in teacher education however many conceptions include moral reasoning as one facet of teacher ethical identity (Berkowitz, 1997). Yet there is little research regarding changes in teacher candidates’ moral reasoning during an undergraduate experience. As well, few interventions have been studied that promote positive changes in teacher candidate moral reasoning as one domain of their professional and ethical identity. These research challenges are addressed in this study. Two undergraduate teacher education longitudinal samples are reported. Connections are made to the research literature on teacher candidates’ gains in moral judgment reasoning as measured by the Defining Issues Test (Rest, 1986). In the two longitudinal samples, teacher candidates’ moral judgment reasoning (P-score) was investigated over a period of 4 years. Moral judgment reasoning was one of three cognitive-developmental domains investigated in the larger study. However, only moral judgment reasoning is examined in the study reported here. The average gain across the two teacher education cohorts is 12.31 with an average effect size of .62. The teacher education students’ average gain score of 12.31 in moral judgment is large. The treatment (complex new social roletaking with guided inquiry) in the longitudinal studies is based on a promising approach for promoting deliberate psychological growth across multiple cognitive developmental domains including moral reasoning. Effect sizes from this promising approach to interventions are compared with longitudinal data effect sizes. Implications are then drawn for teacher education and conceptions of professional identity.

  • Book Chapter
  • Cite Count Icon 1
  • 10.3233/faia250934
Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants
  • Oct 21, 2025
  • Alessio Galatolo + 3 more

The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities. Although considerable effort has been dedicated to aligning LLMs with human moral values, existing benchmarks and evaluations remain largely superficial, typically measuring alignment based on final ethical verdicts rather than explicit moral reasoning. In response, this paper aims to advance the investigation of LLMs’ moral capabilities by examining their capacity to function as Artificial Moral Assistants (AMAs), systems envisioned in the philosophical literature to support human moral deliberation. We assert that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve: not only must AMAs be able to discern ethically problematic situations, they should also be able to actively reason about them, navigating between conflicting values outside of those embedded in the alignment phase. Building on existing philosophical literature, we begin by designing a new formal framework of the specific kind of behaviour an AMA should exhibit, individuating key qualities such as deductive and abductive moral reasoning. Drawing on this theoretical framework, we develop a benchmark to test these qualities and evaluate popular open LLMs against it. Our results reveal considerable variability across models and highlight persistent shortcomings, particularly regarding abductive moral reasoning. Our work connects theoretical philosophy with practical AI evaluation while also emphasising the need for dedicated strategies to explicitly enhance moral reasoning capabilities in LLMs.

  • Book Chapter
  • 10.3233/faia251250
LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams
  • Oct 21, 2025
  • Aju Ani Justus + 1 more

A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. “be risk averse”). LLM outputs mirror human participants’ variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants’ paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.

  • Research Article
  • 10.5014/ajot.2020.74s1-po9019
Comparing Moral Reasoning Across Graduate Occupational and Physical Therapy Students and Practitioners
  • Aug 1, 2020
  • The American Journal of Occupational Therapy
  • Brenda S Howard + 4 more

Date Presented 03/28/20 Investigators compared moral reasoning between OT and physical therapy students and practitioners with the Defining Issues Test-2. Investigators found a significant difference in consolidated (set) versus transitional (forming) patterns of moral reasoning in practitioners versus students. The difference occurred between second-year students and practitioners, suggesting fieldwork experiences enhanced moral reasoning patterns. Implications include providing ethics education support during fieldwork experiences. Primary Author and Speaker: Brenda Howard Additional Authors and Speakers: Cheyenne Kern, Olivia Milliner, Lindsey Newhart, Sarah Burke

  • Research Article
  • Cite Count Icon 12
  • 10.1038/s41598-025-86510-0
AI language model rivals expert ethicist in perceived moral expertise
  • Feb 3, 2025
  • Scientific Reports
  • Danica Dillion + 3 more

People view AI as possessing expertise across various fields, but the perceived quality of AI-generated moral expertise remains uncertain. Recent work suggests that large language models (LLMs) perform well on tasks designed to assess moral alignment, reflecting moral judgments with relatively high accuracy. As LLMs are increasingly employed in decision-making roles, there is a growing expectation for them to offer not just aligned judgments but also demonstrate sound moral reasoning. Here, we advance work on the Moral Turing Test and find that Americans rate ethical advice from GPT-4o as slightly more moral, trustworthy, thoughtful, and correct than that of the popular New York Times advice column, The Ethicist. Participants perceived GPT models as surpassing both a representative sample of Americans and a renowned ethicist in delivering moral justifications and advice, suggesting that people may increasingly view LLM outputs as viable sources of moral expertise. This work suggests that people might see LLMs as valuable complements to human expertise in moral guidance and decision-making. It also underscores the importance of carefully programming ethical guidelines in LLMs, considering their potential to influence users’ moral reasoning.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.