Integrating Ensemble Clustering and Text Embeddings for Estimating the Factor Loadings of Self-Report Scales.
Advances in large language models can provide opportunities to evaluate the characteristics of scales prior to data collection. In this study, we explore if item text can be used to predict a scale's psychometric properties. Specifically, we examine if clustering consensus (i.e., the frequency by which items are grouped with other items from the same underlying factor across multiple clustering algorithms), and a cosine similarity metric (i.e., the semantic similarity of items to other items from the same factor), can be used to predict exploratory factor analysis (EFA) factor loadings. Across six scales with varying sample sizes, number of factors/items, we found that both the cosine similarity and ensemble clustering consensus methods predicted factor loading values. While the methods share some conceptual and empirical overlap, and results vary by scale, the ensemble clustering approach explains incremental variance above and beyond cosine similarity in predicting factor loadings. Using both methods in conjunction can be a useful way to identify problematic items prior to data collection and help researchers develop more optimal scales from the onset, thereby potentially saving time, resources, and increasing the likelihood of developing sound measures.
- Research Article
1
- 10.25136/2409-8698.2024.4.70455
- Apr 1, 2024
- Litera
The subject of the study is the analysis and improvement of methods for determining the relevance of project names to the information content of purchases using large language models. The object of the study is a database containing the names of projects and purchases in the field of electric power industry, collected from open sources. The author examines in detail such aspects of the topic as the use of TF-IDF and cosine similarity metrics for primary data filtering, and also describes in detail the integration and evaluation of the effectiveness of large language models such as GigaChat, GPT-3.5, and GPT-4 in text data matching tasks. Special attention is paid to the methods of clarifying the similarity of names based on reflection introduced into the prompta of large language models, which makes it possible to increase the accuracy of data comparison. The study uses TF-IDF and cosine similarity methods for primary data analysis, as well as large GigaChat, GPT-3.5 and GPT-4 language models for detailed verification of the relevance of project names and purchases, including reflection in model prompta to improve the accuracy of results. The novelty of the research lies in the development of a combined approach to determining the relevance of project names and purchases, combining traditional methods of processing text information (TF-IDF, cosine similarity) with the capabilities of large language models. A special contribution of the author to the research of the topic is the proposed methodology for improving the accuracy of data comparison by clarifying the results of primary selection using GPT-3.5 and GPT-4 models with optimized prompta, including reflection. The main conclusions of the study are confirmation of the prospects of using the developed approach in the tasks of information support for procurement processes and project implementation, as well as the possibility of using the results obtained for the development of text data mining systems in various sectors of the economy. The study showed that the use of language models makes it possible to improve the value of the F2 measure to 0.65, which indicates a significant improvement in the quality of data comparison compared with basic methods.
- Research Article
- 10.28945/5693
- Jan 1, 2026
- Journal of Information Technology Education: Research
Aim/Purpose: The study investigates the factors influencing the acceptance and utilisation of large language models (LLMs) (predictor variables of LLM usage), such as ChatGPT, in Learning design by instructional designers and university-teaching academics from various countries. Background: Large language models (LLMs) have exploded onto the scene, transforming the landscape of learning design. Instructional designers and university teaching academics have been overburdened with content creation for their teaching programmes, and the arrival of LLM models will help in this regard by developing more interactive content that drives student engagement and, in turn, contributes to student success. Since LLMs are a relatively new phenomenon, little is known about the factors influencing their acceptance in learning design; therefore, this research is needed, as learning design principles are the bedrock of student engagement and success. Methodology: A cross-sectional correlational quantitative study was employed. Data was collected using an online questionnaire posted on social media, including LinkedIn, from 203 instructional designers and university teaching academics. Purposive and snowball sampling methods were used to target instructional designers and university teaching academics at colleges and universities worldwide. Participants were asked to share the survey link with fellow instructional designers and university-teaching academics in their communities. The factor structure of the data was determined using exploratory factor analysis. Nonetheless, the factor structure derived from the LLMs did not entirely reflect the original configuration of the Unified Theory of Acceptance and Use of Technology (UTAUT3), as certain predictors appeared to coalesce, indicating LLMs’ unique nature in learning design. Confirmatory factor analysis was used to verify the fit of the data on the measurement model. First-order and second-order structural modelling were used to identify the structural relationships among the variables. Contribution: The study determines significant factors for the acceptance of LLMs by instructional designers and academic teaching staff in learning design, enabling possible opportunities for best practices in the field through interventions to optimize LLM usage. The study applies the technology acceptance model to the emerging LLM technology and extends the technology acceptance model by adding the trust construct as a predictor variable. Findings: The structural analysis results indicated that the ingrained LLM practices, LLM peer-driven expectations, innovative propensity towards LLM adoption, reliability and provider trust in LLMs, and ease of use and support influenced perceived LLM benefits and usage, but community standards and infrastructure had no influence. The second-order structural equation modelling indicated that perceived LLM benefits and usage and ingrained LLM habits contributed most to the learning design. Recommendations for Practitioners: Teaching academics and instructional designers must use LLMs in designing content, assessments, and interactive learning activities, and attend LLM training workshops on prompting and best practices in integrating LLMs into learning and teaching to see their benefits; hence, regular use of LLMs will then lead to trust and innovation in LLMs usage, enhancing learning design and improving student learning outcomes. Recommendation for Researchers: Researchers must use mixed methods approaches to have a deeper understanding of the factors influencing LLMs. Since habit and perceived LLM benefits and usage contributed the most variance to learning design, researchers must investigate strategies that optimise these factors in learning design, such as effective intervention strategies that can help form positive LLM habits. In addition, the findings provide researchers with a starting point for future research. Further researchers must investigate interventions that optimise the influence of personal innovativeness and trust that contributed the least variance to learning design, hence unlocking the potential of LLMs in learning design through innovation, responsible, and ethical use. Impact on Society: The use of LLMs in learning design has a high possibility of transforming education, specifically the learning design landscape. Using LLMs will free up more time for teaching academics and instructional designers so that they spend more time on higher-order thinking skill demands. Consequently, the students will be exposed to more engaging and interactive content, resulting in improved learning outcomes. Future Research: Future research must include context-derived external variables in technology acceptance models, such as levels of prompting competencies, to provide a deeper understanding of LLMs. In addition, future research must be based on the application and impact of LLMs on student engagement and success, and their attainment of 21st-century skills.
- Research Article
1
- 10.1108/mlag-01-2025-0001
- Nov 28, 2025
- Machine Learning and Data Science in Geotechnics
Purpose The purpose of this study is to explore the application of automated grading systems in geotechnics using large language models (LLMs) and cosine similarity for enhanced assessment and educational content generation. By training and testing LLMs on synthetic and real student data, the study seeks to develop robust systems for grading technical reports and open-ended questions, aligned with industry standards. Additionally, it aims to enhance student learning through auto-grading, immediate feedback and content generation, while addressing ethical considerations such as data privacy and fairness. Ultimately, the study strives to demonstrate the potential of LLMs to improve consistency, efficiency and educational outcomes. Design/methodology/approach The study employs a mixed-methods approach to develop and validate automated grading systems in geotechnics. Initially, correct answers were generated manually and synthetically using a generative pre-trained transformer model, with synthetic answers compared to correct ones via cosine similarity. Real student answers underwent similar evaluation. A Web-based tool was created to assess responses in real-time, providing dynamic feedback. Additionally, LLMs were fine-tuned on geotechnics textbooks and validated using synthetic and real student data. Anonymized student project reports were graded automatically, showcasing the potential and limitations of LLMs in consistent grading and educational content generation. Ethical considerations were addressed throughout. Findings The study demonstrated the potential of LLMs in geotechnics education by developing ML-driven systems for grading and content generation. The grading system, using cosine similarity and LLMs, provided consistent and objective assessments comparable to human graders. Immediate feedback on open-ended questions enhanced learning outcomes, enabling students to address knowledge gaps effectively. Fine-tuning LLMs with geotechnics textbooks and industry standards facilitated the generation of accurate, relevant questions and answers, further improved by retrieval-augmented generation (RAG). Data augmentation techniques enhanced model robustness, while ethical considerations, including data privacy, fairness, and transparency, ensured responsible deployment and fostered trust among stakeholders. Originality/value This study offers originality and value by pioneering the application of LLMs and cosine similarity for automated grading in geotechnics education, a domain with limited exploration in educational technology. By integrating RAG and fine-tuning LLMs with domain-specific textbooks, it bridges the gap between advanced machine learning techniques and practical applications in engineering education. The development of real-time feedback tools and robust grading systems enhances both student learning and instructional efficiency. Furthermore, addressing ethical considerations such as fairness and data privacy sets a precedent for responsible artificial intelligence (AI) deployment, contributing to the broader adoption of AI in academia.
- Research Article
- 10.1001/jamanetworkopen.2026.2750
- Mar 23, 2026
- JAMA Network Open
Large language models (LLMs) are increasingly applied to mental health contexts, yet their capacity to generate responses that align with evidence-based psychotherapy remains uncertain. Motivational interviewing (MI), a structured counseling approach, provides an empirically grounded setting for evaluating alignment between LLM-generated and human therapist responses. To evaluate how closely an LLM's responses align with therapist responses in MI sessions, using automated similarity metrics. This cross-sectional study used high-fidelity therapist-client transcripts annotated with the Motivational Interviewing Treatment Integrity system. Transcripts were sourced from publicly available counseling videos. For each therapist turn, the GPT-4o LLM generated a response using a standardized, MI-informed prompt based on the preceding conversation context. Analyses were conducted between March and May 2025. Alignment between LLM-generated and therapist responses was assessed using (1) cosine similarity based on sentence embeddings to capture semantic overlap and (2) DeepEval, a contextual deep-learning-based metric assessing coherence and contextual appropriateness. A therapist topic-consistency index quantified within-session thematic coherence and was examined as a moderator of alignment. A total of 3706 therapist turns from 154 MI sessions were evaluated. Mean (SD) DeepEval scores were higher than mean (SD) cosine similarity scores (0.72 [0.31] vs 0.29 [0.20]; P < .001), suggesting limited semantic overlap despite greater contextual appropriateness. Therapist topic consistency significantly moderated similarity, where cosine similarity was higher in high-consistency than low-consistency sessions (mean [SD] difference, 0.027 [0.007]; t3706 = 3.987; P < .001), as was DeepEval score (mean [SD] difference, 0.038 [0.010]; t3706 = 3.747; P < .001). Correlation between metrics was negligible (Spearman ρ, -0.01), indicating that they captured distinct aspects of response alignment. LLM performance declined slightly across longer conversations (mean [SD] slope reduction for cosine similarity, -0.0005 [0.0016], and for DeepEval, -0.0005 [0.0022]), with increased verbosity and signs of reduced contextual grounding. In this cross-sectional study of 154 MI sessions, prompted LLMs showed general alignment with therapist responses in MI-oriented conversations, as judged by automated similarity metrics. However, limitations in long-range coherence, stylistic alignment, and the use of indirect proxies for therapeutic quality highlight the need for improved prompt design, MI-specific evaluation methods, and clinical validation before integration into mental health care.
- Research Article
- Jan 1, 2024
- AMIA ... Annual Symposium proceedings. AMIA Symposium
The rapid development of Large Language Models (LLMs) has opened up new possibilities for their role in supporting research. This study assesses whether LLMs can generate "thoughtful" research plans in the domain of Medical Informatics and whether LLM-generated critiques can improve such plans. Using an LLM pipeline, we prompt four LLMs to generate primary research plans. Subsequently, these plans are mutually critiqued and then the LLMs are prompted to refine their outputs based on these critiques. These original and improved responses are then reviewed by human evaluators for errors, hallucinations, etc. We employ ROUGE scores, cosine similarity, and length differences to quantify similarities across responses. Our findings reveal variations in outputs among four LLMs, the impact of critiques, and differences between primary and secondary outputs. All LLMs produce cogent outputs and critiques, integrating feedback when generating improved outputs. Human evaluators can distinguish between primary and secondary responses in most cases.
- Research Article
- 10.69693/jesa.v1i1.2
- Mar 13, 2024
- Journal of Engineering and Science Application
A college's students are an essential component. The college always opens registration for new students each year. Every year, more than 1,000 prospective new students register. Because of this, the new student admissions committee is constantly overwhelmed when responding to campus-related questions. As a result, developing a chatbot to assist new students is necessary. The best similarity method is needed for the development of a chatbot using a retrieval-model approach. The New Student Admission Chatbot and the Similarity Method are compared in this study using the Retrieval-Based Concept. The cosine, Jaccard, dice, euclidean, Manhattan, Canberra, and Chebyshev similarity methods are compared. In the context of Universitas Pahlawan Tuanku Tambusai, the data used are information about new students as well as accreditation for study program. There are 41 pieces of information used. Labels and information make up data. According to the test results, the dice and cosine similarity methods are the most effective. On all tested thresholds, dice and cosine similarity achieved an f1-score above 80%. Recall produces extremely optimal results, including 100%.Over 75% of the time, good results are reliably achieved. This demonstrates that the retrieval-model concept can be applied
- Research Article
4
- 10.1609/aaai.v39i12.33334
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Involving collaborative information in Large Language Models (LLMs) is a promising technique for adapting LLMs for recommendation. Existing methods achieve this by concatenating collaborative features with text tokens into a unified sequence input and then fine-tuning to align these features with LLM's input space. Although effective, in this work, we identify two limitations when adapting LLMs to recommendation tasks, which hinder the integration of general knowledge and collaborative information, resulting in sub-optimal recommendation performance. (1) Fine-tuning LLM with recommendation data can undermine its inherent world knowledge and fundamental competencies, which are crucial for interpreting and inferring recommendation text. (2) Incorporating collaborative features into textual prompts disrupts the semantics of the original prompts, preventing LLM from generating appropriate outputs. In this paper, we propose a new paradigm, Collaborative LoRA (CoRA), with a collaborative query generator. Rather than input space alignment, this method aligns collaborative information with LLM's parameter space, representing them as incremental weights to update LLM's output. This way, LLM perceives collaborative information without altering its general knowledge and text inference capabilities. Specifically, we employ a collaborative filtering model to extract user and item embeddings and inject them into a set number of learnable queries. We then convert collaborative queries into collaborative weights with low-rank properties and merge the collaborative weights into LLM's weights, enabling LLM to perceive the collaborative signals and generate personalized recommendations without fine-tuning or extra collaborative tokens in prompts. Extensive experiments confirm that CoRA effectively integrates collaborative information into LLM, enhancing recommendation performance.
- Research Article
27
- 10.3390/electronics13071361
- Apr 4, 2024
- Electronics
The purpose of this paper is to explore the implementation of retrieval-augmented generation (RAG) technology with open-source large language models (LLMs). A dedicated web-based application, PaSSER, was developed, integrating RAG with Mistral:7b, Llama2:7b, and Orca2:7b models. Various software instruments were used in the application’s development. PaSSER employs a set of evaluation metrics, including METEOR, ROUGE, BLEU, perplexity, cosine similarity, Pearson correlation, and F1 score, to assess LLMs’ performance, particularly within the smart agriculture domain. The paper presents the results and analyses of two tests. One test assessed the performance of LLMs across different hardware configurations, while the other determined which model delivered the most accurate and contextually relevant responses within RAG. The paper discusses the integration of blockchain with LLMs to manage and store assessment results within a blockchain environment. The tests revealed that GPUs are essential for fast text generation, even for 7b models. Orca2:7b on Mac M1 was the fastest, and Mistral:7b had superior performance on the 446 question–answer dataset. The discussion is on technical and hardware considerations affecting LLMs’ performance. The conclusion outlines future developments in leveraging other LLMs, fine-tuning approaches, and further integration with blockchain and IPFS.
- Research Article
13
- 10.1145/3765895
- Sep 3, 2025
- ACM Transactions on the Web
There is growing interest in understanding how people interact with large language models (LLMs) and whether such models elicit dependency or even addictive behaviour. Validated tools to assess the extent to which individuals may become dependent on LLMs are scarce and primarily build on classic behavioral addiction symptoms, adapted to the context of LLM use. We view this as a conceptual limitation, as the LLM-human relationship is more nuanced and warrants a fresh and distinct perspective. To address this gap, we developed and validated a new 12-item questionnaire to measure LLM dependency, referred to as LLM-D12. The scale was based on the authors' prior theoretical work, with items developed accordingly and responses collected from 526 participants in the UK. Exploratory and confirmatory factor analyses, performed on separate halves of the total sample using a split-sample approach, supported a two-factor structure: Instrumental Dependency (six items) and Relationship Dependency (six items). Instrumental Dependency reflects the extent to which individuals rely on LLMs to support or collaborate in decision-making and cognitive tasks. Relationship Dependency captures the tendency to perceive LLMs as socially meaningful, sentient, or companion-like entities. The two-factor structure demonstrated excellent internal consistency and clear discriminant validity. External validation confirmed both the conceptual foundation and the distinction between the two subscales. The psychometric properties and structure of our LLM-D12 scale were interpreted in light of the emerging view that dependency on LLMs does not necessarily indicate dysfunction but may still reflect reliance levels that could become problematic in certain contexts.
- Research Article
- 10.1093/eurheartj/ehaf784.4506
- Nov 5, 2025
- European Heart Journal
Introduction Large language models (LLMs) have the potential to realize accurate risk stratification and disease prediction by integrating multimodal data, such as electronic health records, medical images, and genomic profiles. However, in complex tasks like atherosclerosis risk estimation, the design of the prompt is critical to bring out the performance of LLMs. Several researches have demonstrated that in-context learning, where the prompt contains some cases, improves the performance of LLMs when the contained cases are carefully selected. Because manual selection usually requires enormous costs, automatic case selection has been desired. Methods This study aims to improve the effectiveness of in-context learning in LLM for atherosclerosis risk prediction by introducing a strategy for case selection. Our method involves two key components: feature selection and case selection. In the feature selection phase, we compute the mutual information between the clinical features and atherosclerosis risk using the database to extract the essential clinical features. This feature selection removes the unimportant features that disturb the performance of the case selection mechanism. Then, in the case selection phase, we select several cases from the database that are the most similar to the patient being diagnosed. This study compares three metrics to compute the similarity score: - Mahalanobis distance (trained via large margin nearest neighbor classification; LMNN [1]) - Cosine similarity (between the vector representations in KNN-augmented in-context example selection; KATE [2]) - Euclidean distance (between the prompt features) This combined strategy supplies the LLM with informative cases and thereby enhances the diagnostic performance. Results We used a fine-tuned Llama3-8B model for our target language and analyzed a dataset of 117,709 cases. Each case contained 96 features that were collected from annual health check-up data. From this dataset, 1,000 cases each were randomly selected for validation and test datasets. The remaining data were used as the database in our system. We considered a binary classification task and assigned the positive label when the Cardio-Ankle Vascular Index (CAVI) was greater than 8.0. The proportion of selected features was adjusted in 10% increments to maximize the F1 score on the validation dataset. The result shows that our proposed method, the strategic feature and case selections, significantly improved the F1 score for the test dataset by around 2% compared with zero-shot prompt and random selection (see the attached table). Among three similarity metrics, Euclidean distance yielded the highest F1 scores. We also observed that the feature selection phase with the adjusted number of selected features improved the F1 score in almost all settings. Conclusion Our strategy for feature and case selections improved the performance of in-context learning for LLM-based atherosclerosis risk prediction.
- Research Article
129
- 10.1016/j.sbspro.2011.04.053
- Jan 1, 2011
- Procedia - Social and Behavioral Sciences
Reliability and validity of the Turkish Version of the UCLA Loneliness Scale (ULS-8) among university students
- Research Article
- 10.24106/kefdergi.1797602
- Oct 11, 2025
- Kastamonu Eğitim Dergisi
Purpose: This study, it was aimed to develop a valid and reliable measurement tool by evaluating the psychometric properties of the Mathematics Learning Disabilities Screening Scale (MLDSS) for elementary and secondary school students. In this context, a measurement tool that is culturally and linguistically appropriate for the Turkish context and overcomes the limitations of translated scales was developed. Design/Methodology/Approach: The study was conducted as a scale development study within a survey research design. The scale's items were developed based on DSM-5 criteria and an extensive literature review. The sample consisted of 644 students, identified by their teachers, from 120 schools across Türkiye's seven geographical regions. The psychometric properties of the scale were evaluated using Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA), along with internal consistency (Cronbach's Alpha) and test-retest reliability analyses. Findings: The EFA results revealed a three-factor structure (Number Sense, Calculation, and Mathematical Reasoning) that explained 68.5% of the total variance. The CFA confirmed this structure, with goodness-of-fit indices indicating an excellent model fit (e.g., CFI = .95, IFI = .95, RMSEA = .069, χ²/df = 2.41). The scale demonstrated high internal consistency (Cronbach's Alpha = .93) and strong test-retest reliability (r = .90). Highlights: The mathematics learning disability screening scale is a valid and reliable instrument developed for elementary and secondary school students. Its three-factor structure is consistent with modern theories of mathematics learning disabilities and its strong psychometric properties make it a valuable tool for educators and experts in early identification and intervention planning.
- Preprint Article
1
- 10.1101/2024.10.20.619308
- Oct 23, 2024
- bioRxiv : the preprint server for biology
The integrative analysis of gene sets, networks, and pathways is pivotal for deciphering omics data in translational biomedical research. To significantly increase gene coverage and enhance the utility of pathways, annotated gene lists, and gene signatures from diverse sources, we introduced pathways, annotated gene lists, and gene signatures (PAGs) enriched with metadata to represent biological functions. Furthermore, we established PAG-PAG networks by leveraging gene member similarity and gene regulations. However, in practice, high similarity in functional descriptions or gene membership often leads to redundant PAGs, hindering the interpretation from a fuzzy enriched PAG list. In this study, we developed todenE (topology-based and density-based ensemble) clustering, pioneering in integrating topology-based and density-based clustering methods to detect PAG communities leveraging the PAG network and Large Language Models (LLM). In computational genomics annotation, the genes can be grouped/clustered through the gene relationships and gene functions via guilt by association. Similarly, PAGs can be grouped into higher-level clusters, forming concise functional representations called Super-PAGs. TodenE captures PAG-PAG similarity and encapsulates functional information through LLM, in characterizing network-based functional Super-PAGs. In synthetic data, we introduced a metric called the Disparity Index (DI), measuring the connectivity of gene neighbors to gauge clusterability. We compared multiple clustering algorithms to identify the best method for generating performance-driven clusters. In non-simulated data (Gene Ontology), by leveraging transfer learning and LLM, we formed a language-based similarity embedding. TodenE utilizes this embedding together with the topology-based embedding to generate putative Super-PAGs with superior performance in semantic and gene member inclusiveness.
- Conference Article
- 10.1109/adacis65663.2025.11436703
- Nov 20, 2025
Financial literacy is a well-established area of research. The incorporation of Large Language Models (LLMs) into FinTech solutions has opened up a new avenue of research to determine how LLMs can be used to interact with users and improve financial literacy. Following previous research that focused purely on the GPT models, we have extended this work to investigate how Gemini, Copilot, and DeepSeek respond to basic accounting and finance questions from users, ranging from financially unsophisticated to expert. To investigate this, we use Cosine Similarity and the Flesch Reading Ease Score. The Cosine Similarity results show that the LLMs struggle with distinguishing between users, often defaulting to communicating as an expert. We also conduct a post-hoc analysis where the generated texts are analyzed by an accounting expert. We find that some LLM generated answers are misleading, which could place LLM users with little to no financial literacy at a significant disadvantage, and could lead to them making disastrous financial decisions.
- Research Article
1
- 10.1007/s00146-025-02487-4
- Aug 17, 2025
- AI & SOCIETY
This study explores how large language models (LLMs) can support deductive and inductive thematic coding in real-life contexts, balancing AI-driven efficiency with essential human oversight. Using three datasets from Tearfund, a UK-based Christian charity, we propose a dual-role human–LLM collaborative framework where the LLM functions as an initial annotator and a validator. In the deductive phase, GPT-4o and GPT-4o-mini were compared against human coders. GPT-4o achieved a substantial agreement in multi-label thematic categorization (κ = 0.61–0.65), while GPT-4o-mini showed a moderate agreement (κ = 0.41–0.58). Both models excelled in sentiment analysis (κ = 0.91–0.95), but struggled with evaluating evidence of impact due to contextual complexity (κ ≤ 0.01). GPT-4o-mini exhibited greater output variability and instability than GPT-4o, but benefited more from few-shot learning to mitigate hallucinations. In the inductive phase, GPT-4o demonstrated a strong semantic alignment with human-generated themes (cosine similarity = 0.76–0.79) though its tendency toward broad themes required human refinement. Despite their potential to streamline thematic analysis, LLMs also pose limitations and implementation challenges, including inconsistencies in excerpt extraction (precision = 0.41, recall = 0.53) and the trade-off between the time saved in coding and the time required for human validation. To facilitate practical implementation, we provide reusable prompt templates for four stages: context, instructions, data processing, and verification. Our findings underline the indispensable role of human expertise—from prompt engineering and managing hallucinations to final verification—to ensure accurate and trustworthy AI-assisted analyses. While LLMs can enhance qualitative analysis, their full potential is only realized under skilled human guidance.