Abstract
Abstract Background: Large Language Models´ (LLMs) potential for psychological diagnostics requires systematic evaluation. Objective: To investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across across configurations. Methods: We analyzed 100 youth crisis conversation transcripts rated by four experts using the Nurses Global Assessment of Suicide Scale (NGASR). Using Mixtral-7x8b-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared a) inter-rating-reliability for AI-generated NGASR risk and sum scores, b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff´s α, c) classification metrics of risk categories and individual items against human ratings. Results: LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α=1.00, 95% CI: [1-1] for high & very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α=0.78, 95% CI: [0.67-0.89]) and strongest classification performance (balanced accuracy 0.54-0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity. Discussion: Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate to-human agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have