Building trustworthy large language model-driven generative recommender system for healthcare decision support: A scoping review of corpus sources, customization techniques, and evaluation frameworks.
Building trustworthy large language model-driven generative recommender system for healthcare decision support: A scoping review of corpus sources, customization techniques, and evaluation frameworks.
- Research Article
- 10.1093/geroni/igaf122.2248
- Dec 1, 2025
- Innovation in Aging
Large Language Model-Driven Generative Recommender Systems (LLM-GRSs) are increasingly transforming healthcare, particularly in question-answering systems. This study systematically reviewed their corpora sources, customization techniques, and evaluation metrics. A search of PubMed/MEDLINE, Embase, Scopus, and Web of Science identified 61 studies (2021–2024) using LLM-GRSs for medical information delivery. Corpus sources were categorized into real-world clinical resources (n = 24), literature materials (n = 34), open-source datasets (n = 33), and web-crawled data (n = 11), with 44 studies integrating multiple sources. Key model customization strategies included pre-training, prompt engineering, retrieval-augmented generation (RAG), fine-tuning, in-context learning, and offline learning. Fourteen studies used a single customization technique, while 41 studies combined these methods during model development. The evaluation metrics were classified into three main domains: 1) process metrics, 2) usability metrics, and 3) outcome metrics. The outcome metrics could also be divided into two categories: model-based outcomes and expert-assessed outcomes. The study identified critical gaps in corpus fairness, contributing to biases from geographic, cultural, and socio-economic factors. The reliance on unverified or unstructured data highlights the need for better integration of evidence-based clinical guidelines. Future research should focus on developing a tiered corpus architecture with vetted sources and dynamic weighting, while ensuring model transparency. Additionally, the lack of standardized evaluation frameworks for domain-specific models calls for comprehensive validation of LLM-GRSs in real-world healthcare settings.
- Supplementary Content
69
- 10.15171/ijhpm.2018.43
- May 16, 2018
- International Journal of Health Policy and Management
Background: Patient, public, consumer, and community (P2C2) engagement in organization-, community-, and systemlevel healthcare decision-making is increasing globally, but its formal evaluation remains challenging. To define a taxonomy of possible P2C2 engagement metrics and compare existing evaluation tools against this taxonomy, we conducted a systematic review.Methods: A broad search strategy was developed for English language publications available from January 1962 through April 2015 in PubMed, Embase, Sociological s, PsycINFO, EconLit, and the gray literature. A publication was excluded if: (1) the setting was not healthcare delivery (ie, we excluded non-health sectors, such as urban planning; research settings; and public health settings not involving clinical care delivery); (2) the P2C2 engagement was episodic; or (3) the concept of evaluation or possible evaluation metrics were absent. To be included as an evaluation tool, publications had to contain an evaluative instrument that could be employed with minimal modification by a healthcare organization.Results: A total of 199 out of 3953 publications met exclusion and inclusion criteria. These were qualitatively analyzed using inductive content analysis to create a comprehensive taxonomy of 116 possible metrics for evaluating P2C2 engagement. 44 outcome metrics were grouped into three domains (internal, external, and aggregate outcomes) that included six subdomains: impact on engagement participants, impact on services provided by the healthcare organization, impact on the organization itself, influence on the broader public, influence on population health, and engagement cost-effectiveness. The 72 process metrics formed four domains (direct process metrics; surrogate process metrics; aggregate process metrics; and preconditions for engagement) that comprised sixteen subdomains. We identified 23 potential tools for evaluating P2C2 engagement. The identified tools were published between 1973-2015 and varied in their coverage of the taxonomy, methodology used (qualitative, quantitative, or mixed), and intended evaluators (organizational leaders, P2C2 participants, external evaluators, or some combination). Parts of the metric taxonomy were absent from all tools.Conclusions: By comprehensively mapping potential outcome and process metrics as well as existing P2C2 engagement tools, this review supports high-quality P2C2 engagement globally by informing the selection of existing evaluation tools and identifying gaps where new tools are needed.Systematic Review Registration: PROSPERO registration number CRD42015020317.
- Abstract
- 10.14309/01.ajg.0000778700.88872.e2
- Oct 1, 2021
- American Journal of Gastroenterology
Introduction: Upper GI bleeding (UGIB) is a common indication forced inpatient esophagogastroduodenoscopy (EGD). Outcomes afterwards are often dependent in part by guideline-based post-EGD care. Our aim was to optimize and standardize documentation post-EGD in UGIB to improve clinical care. Methods: National guidelines were used to build optimized etiology- and severity-specific note templates at an academic tertiary referral center. 39 attendings and 15 fellows completed a 10 minute training session in template content & use. We collected pre- and post-intervention on “minimal-standard” (MS) report documentation including patient disposition, diet, & medications. We also recorded documentation of rebleed precautions, and follow-up procedures. Health outcomes measured included guideline-based medication prescriptions, ordering of follow-up EGD if indicated, and clinical cessation of bleeding after discharge. Results: Pre-intervention demonstrated 54% and 36% of 108 patients received guideline-based inpatient and outpatient proton pump inhibitor (PPI), respectively. At baseline, 67.6% were referred for standard-of-care repeat EGD and only 36.1% of reports met MS report criteria. After template implementation, of 309 EGDs for UGIB over 6 months, the templates were used in 72% of cases. Workload was reduced by a mean of 33 “clicks,” 356 free text characters and 2 minutes per report (Figure 1, Panel A). There was a significant improvement in documentation of disposition (63.0% to 74.1%, p=0.028), appropriate PPI use(57.1% to 69.8%, p=0.035), rebleed (22.2% to 44.3%, p< 0.001) recommendations, and MS report completion (27.8% to 40.8%,p=0.016). There was significant improvement in inpatient PPI administered (53.6% to 73.6%, p< 0.001), discharge PPI prescription(35.7% to 54.0%, p=0.004), octreotide regimen(79.2% to 93.2%,p=0.048) and follow-up EGD orde r(67.6% to 86.7%, p< 0.001) (Table 1). Template usage (64%-79%), process (38%-50%) and outcome(67%-95%) metrics remained high over 6 months (Figure 1, Panel B & C). Inpatient PPI compliance(71.7% vs 63.4%; p=0.069) and follow-up EGD orders (85.7% vs 77.3%;p=0.028) were improved with template use. Conclusion: Our project leveraged endoscopy software to standardize efficient provider documentation, resulting in improved clinical care. Our intervention required minimal implementation cost, low burden of maintenance, and sustainability with high utilization rates over 6 months. Similar endoscopy templates can be applied to other health systems and endoscopic procedures to improve the quality of care.Table 1.: Process and outcome metrics pre- vs post- intervention. Statistical analysis using X2 testing.Figure 1.: Panel A: Sample endoscopy documentation template, Upper Gastrointestinal Bleed High Risk Non-Variceal Panel B: Process metrics pre- vs post-intervention of template usage (circle), minimal-standard non-variceal report (square), minimal-standard report (all etiologies) (diamond), and minimal-standard variceal report (triangle) Panel C: Outcome metrics pre- vs post-intervention of compliance with coordinated repeat EGD (circle), inpatient PPI regimen (square) and discharge PPI regimen (triangle).
- Research Article
4
- 10.1016/j.cmpb.2023.107429
- Apr 18, 2023
- Computer methods and programs in biomedicine
Relating process and outcome metrics for meaningful and interpretable cannulation skill assessment: A machine learning paradigm
- Discussion
79
- 10.1111/acem.12716
- Jul 20, 2015
- Academic Emergency Medicine
What we have learned from a decade of ED crowding research.
- Conference Article
5
- 10.1115/detc2006-99642
- Jan 1, 2006
The overall objective of the study is to gain an insight into design ideation. Towards that goal we are empirically evaluating the effectiveness of design ideation methods. Key components of ideation methods have been identified and effectiveness metrics have been developed. This paper presents experimental results conducted on six ideation components (Provocative Stimuli, Suspend Judgment, Flexible Representation, Frame of Reference Shifting, Incubation, and Provocative Stimuli). These experiments were conducted simultaneously at the Design (Engineering) and Lab (Cognitive Psychology) levels; a previously developed experimental procedure considered the alignment of experiments at these two levels. The understanding of ideation components was improved (some are stronger, some are easier to manipulate, interactions are complex, etc.). Data collected revealed that some ideation components have similar effects and could be grouped into higher-level (ideation) principles according to their effects. A distinction was made between process and outcome metrics and it was found that outcome metrics were harder to improve than process metrics. A correlation was also found between quality and quantity, this supports the widely accepted belief that generating more ideas increases the chances of obtaining higher quality ideas.Copyright © 2006 by ASME
- Research Article
- 10.1161/hcq.12.suppl_1.103
- Apr 1, 2019
- Circulation: Cardiovascular Quality and Outcomes
Background: Team communication about hospital quality efforts in acute myocardial infarction and heart failure (AMI-HF) may affect compliance with hospital transitional care metrics. Methods: At 2 years, hospitals (n=35) participating in the Patient Navigator Program completed surveys on 5 types of communication (sharing meeting minutes, regular team meetings or conference calls with team leaders, a shared checklist, and electronic medical record (EMR)-directed communication) supporting program implementation. Results were assessed for association with 3 outcomes (30-day unadjusted AMI-HF readmission and in-hospital risk adjusted AMI death) and 14 processes: left ventricular systolic dysfunction evaluation, prescription of renin-angiotensin system and beta-blocker medications; identifying HF cases pre-discharge; medication reconciliation documentation on admission, discharge and both times [AMI-HF); planned follow-up in 7 days [HF]; documentation of self-care education and when to call healthcare providers [AMI-HF] and documentation of medication instructions, timing, and changes [AMI-HF]). In STEMI and NSTEMI, performance composites, overall defect free care and referral to cardiac rehabilitation were assessed. Univariate analyses were completed. Results: There were no differences in process or outcome metrics for sharing meeting minutes, regular team meetings or conference calls with leaders or using a shared checklist. EMR-directed communication was associated with a greater likelihood of discharge medication reconciliation (100% vs 68.4%, p =.027) and prescribed medication documentation, 100% vs 66.7%, p =.024). Sites that used 2-5 vs 0-1 communication types were more likely to identify patients with HF pre-discharge (100% vs 60%, p =.018), perform discharge medication reconciliation (100% vs 66.7%, p =.021), complete education documentation (93.3% vs 58.8%, p =.041) and medication instruction documentation (100% vs 64.7%, p=.019); but they were less likely to improve STEMI performance composite scores (37.5% vs 76.5%, p =.036). Conclusion: Team communication via EMR and using 2+ communication methods promoted some process metric improvements. Some communication methods may have had low use and process and outcome metrics that were unchanged may have been underpowered to detect differences.
- Supplementary Content
2
- 10.1016/j.onehlt.2024.100959
- Jan 2, 2025
- One Health
One Health interventions and challenges under rural African smallholder farmer settings: A scoping review
- Research Article
- 10.59645/tji.v4i1.420
- Dec 18, 2024
- The Journal of Informatics
eHealth systems have exploded in popularity worldwide in recent years, fundamentally altering how health services are delivered. However, there has been a long discussion about what usability metrics should be used to evaluate eHealth systems. This paper assesses the usability metrics mostly applied in evaluating eHealth systems. A scoping review method was used, whereas 15 papers were reviewed after being extracted from 2112 studies from PubMed, Emerald Insight, and SAGE. The search terms were "usability" in combination with "metrics", "evaluation metrics", "factors", "attributes", "framework", "models", "taxonomy", "eHealth", "health", "telehealth", and "mHealth". The study established that usability metrics, including ease of use, task-technology match, navigation, information quality, technical quality, guide and support, consistency, visibility, flexibility, accessibility, and collaboration, are mostly applied in evaluating eHealth systems’ usability. Although the metric named collaboration had a low frequency, this study recommends that it be used in assessing eHealth systems due to its necessity. Thus, the healthcare process involves multiple healthcare professionals collaborating to accomplish the patients’ healthcare process. Additionally, the study revealed limited studies on the usability of eHealth systems in developing countries, specifically Africa. Subsequently, a few African studies applied generic usability metrics only to evaluating eHealth systems compared to developed countries. Future studies should consider validating these metrics' applicability in contexts in developing countries with limited resources.
- Research Article
18
- 10.1016/j.jacr.2015.06.038
- Sep 11, 2015
- Journal of the American College of Radiology : JACR
Quality Measurements in Radiology: A Systematic Review of the Literature and Survey of Radiology Benefit Management Groups
- Research Article
- 10.1161/circoutcomes.8.suppl_2.248
- May 1, 2015
- Circulation: Cardiovascular Quality and Outcomes
Background: Personalized benchmarking of practice is assuming increasing importance to assure patient safety and quality, cost-effective care. Comparing performance with national and local benchmarks has value for providing feedback and opportunity for continued practice improvement to the practitioner. From a hospital and practice perspective it can be used to identify outliers and potential concerns (either system or attributable to the individual) as well as its potential uses to reduce variance, validate care, and support “value purchasing” and managed care participation. Objectives: With the input of physicians, hospital administration, and CQI personnel, a tool was developed to allow assessment of individual and aggregate physician performance. The developed tool was developed to meet both personal and institutional objectives. It had to be timely, standardized, and accurate, with available national or institutional benchmarks. Both process and outcome indicators were felt to be appropriate for inclusion. Mandated, validated and relevant data collection was given preference for inclusion (ranked in order of preference: NYS DOH angioplasty data base, NCDR data, in-house non reported data) and objective hard outcomes were given preference over more subjectively collected information. A rolling 4 quarters of data was included for statistically infrequent events to minimize outliers due to sample inadequacy. In addition to tracking process and outcome metrics, utilization was also felt to be appropriate for inclusion. The developed tool incorporated data from multiple data sets with different validation turn around times, which resulted in different reporting time intervals. A total of 38 measures were collected including : mortality, major comorbidity, quality and utilization concerns. Measures deviating from benchmarks or demonstrating substantial variance were targeted for drill down and intervention to improve care. Results: A physician dashboard was developed and applied to 18 interventionalists, permitting benchmarking of individual and overall lab performance in meeting defined process, outcome and volume metrics. The dashboard facilitates individual and system benchmarking of performance, identifies areas of concerns, provides physician feedback, and promotes process improvement. Results are highlighted in green (goal met) or red (goal not met) for rapid assessment. The developed dashboard will be displayed. Conclusions: A dashboard was created to provide timely, rapid, visible assessment of comparative physician performance. The developed tool has proven useful to both hospital administration and physicians in monitoring and improving performance. The dashboard continues to evolve with continued periodic modifications addressing specific goals and concerns.
- Research Article
11
- 10.1017/dmp.2018.110
- Nov 13, 2018
- Disaster Medicine and Public Health Preparedness
The US Centers for Disease Control and Prevention (CDC)-funded Preparedness and Emergency Response Research Centers (PERRCs) conducted research from 2008 to 2015 aimed to improve the complex public health emergency preparedness and response (PHEPR) system. This paper summarizes PERRC studies that addressed the development and assessment of criteria for evaluating PHEPR and metrics for measuring their efficiency and effectiveness. We reviewed 171 PERRC publications indexed in PubMed between 2009 and 2016. These publications derived from 34 PERRC research projects. We identified publications that addressed the development or assessment of criteria and metrics pertaining to PHEPR systems and describe the evaluation methods used and tools developed, the system domains evaluated, and the metrics developed or assessed. We identified 29 publications from 12 of the 34 PERRC projects that addressed PHEPR system evaluation criteria and metrics. We grouped each study into 1 of 3 system domains, based on the metrics developed or assessed: (1) organizational characteristics (n = 9), (2) emergency response performance (n = 12), and (3) workforce capacity or capability (n = 8). These studies addressed PHEPR system activities including responses to the 2009 H1N1 pandemic and the 2011 tsunami, as well as emergency exercise performance, situational awareness, and workforce willingness to respond. Both PHEPR system process and outcome metrics were developed or assessed by PERRC studies. PERRC researchers developed and evaluated a range of PHEPR system evaluation criteria and metrics that should be considered by system partners interested in assessing the efficiency and effectiveness of their activities. Nonetheless, the monitoring and measurement problem in PHEPR is far from solved. Lack of standard measures that are readily obtained or computed at local levels remains a challenge for the public health preparedness field. (Disaster Med Public Health Preparedness. 2019;13:626-638).
- Preprint Article
- 10.5194/egusphere-egu25-3929
- Mar 14, 2025
Disasters have been growing larger in scale and more complex due to factors such as increased risks as a result of changes in social and economic environments, as well as shifts in infrastructure and living conditions. Under these conditions, it is crucial to minimize and prevent damage caused by disasters. In order to achieve this goal, proactive measures and sustained efforts are required to reduce risks and prevent their recurrence. The Ministry of the Interior and Safety, the general division of disaster management in South Korea, has been investigating the causes of disasters under the&#12300;Framework Act on the Management of Disaster and Safety&#12301;. These investigations aim to identify the root causes of disasters and implement effective measures to prevent similar incidents in the future. From February 2014 to June 2024, the ministry discovered 301 improvement tasks through these investigations and had relevant organizations to carry out the necessary improvements and adjustments. As a result, by June 2024, 254 improvement tasks had been completed, resulting in a high implementation rate of approximately 84.4%. However, disasters such as storm and flood damage, as well as landslides, continue to recur due to underlying factors such as localized heavy rainfall, the failure to designate high-risk vulnerable areas, and insufficient management systems.Therefore, this study aims to develop implementation indicators as part of the post-management process for disaster cause investigations, with the objective of improving the effectiveness of implementing improvement measures. These efforts are intended to prevent the recurrence of similar disaster incidents. For this purpose, the study analyzed domestic and international cases of implementation monitoring in disaster management, developed implementation indicators for improvement tasks identified through disaster cause investigations, and established strategies for their application. The implementation indicators were designed to enable evaluation at each stage of improvement task execution, incorporating input, process, output, and outcome metrics. They are divided into short-term indicators, which can be implemented immediately, and long-term indicators, which account for the disaster reduction effects achieved through task implementation. &#160;Short-term indicators are categorized into two types: evaluation of implementation planning and evaluation of implementation outcomes. Long-term implementation indicators were developed using 24 evaluation metrics to facilitate step-by-step assessments of input, process, output, and outcome stages. These indicators are designed to evaluate not only the implementation monitoring and evaluation process but also the transition from disaster recovery to prevention, thereby strengthening the feedback loop for a sustainable and virtuous cycle. A pilot test was conducted on a case from the joint government disaster cause investigation in South Korea in December 2022 to evaluate the appropriateness and applicability of the established implementation indicators.&#160;Applying the implementation indicators for improvement tasks developed in this study to evaluate the execution of improvement tasks is expected to contribute to establishing a virtuous cycle that transitions to proactive disaster prevention through post-disaster management.
- Research Article
- 10.1016/j.amjcard.2025.12.022
- Jan 1, 2026
- The American journal of cardiology
Impact of Comprehensive STEMI Protocol on Process Metrics and Clinical Outcomes in STEMI Patients With Nonsystem Delay.
- Abstract
2
- 10.1016/j.hrtlng.2014.06.021
- Jul 1, 2014
- Heart & Lung - The Journal of Acute and Critical Care
Safe Passage: a Nurse-Led Multidisciplinary Team Approach to Improving Transitions and Reducing Readmissions for Heart Failure Patients
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.