LegisSearch: navigating legislation with graphs and large language models
Abstract Navigating and retrieving relevant excerpts of legislation is challenging, requiring time and effort, especially to fine-tune appropriate input search queries. Furthermore, the continuously growing, heterogeneous body of laws, combined with a deep interconnection among normative acts, adds a layer of complexity: some potentially relevant rules may be hidden in articles that, through multiple citations and references, might be relevant for the input query. Traditional search systems, based on keywords or more sophisticated approaches as BM25 or TF-IDF, do not support such flexible exploration, being ineffective at handling contextual information. To address these challenges, recent research proposed using graph data models for legislative knowledge management, introducing a straightforward approach to handling network complexity. They adopted the Property Graph data structure, demonstrating how it provides semantics and navigation power, supporting advanced querying tools for legislative acts, and implemented it on the Italian legislation. In this paper, we build on recent results on legislative knowledge management with graphs by proposing LegisSearch, an effective navigation system that, combining the graph data model with pre-trained Large Language Models and universal text embeddings, allows users to conduct powerful searches within a legislative system. We implement LegisSearch within the Italian graph of national laws, and we test its performance across multiple domains by comparing its search results with those provided in specific thematic areas by Italian ministries on their official websites, demonstrating its superior retrieval performance over traditional search systems and testing the contribution of each component.
- Research Article
5
- 10.1016/j.jbi.2023.104486
- Sep 16, 2023
- Journal of Biomedical Informatics
A self-supervised language model selection strategy for biomedical question answering
- Preprint Article
- 10.2196/preprints.68320
- Nov 3, 2024
BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.
- Research Article
18
- 10.1371/journal.pdig.0000568
- Aug 21, 2024
- PLOS digital health
Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.
- Book Chapter
11
- 10.1007/978-981-19-4453-6_2
- Jan 1, 2022
The remarkable progress in Natural Language Processing (NLP) brought about by deep learning, particularly with the recent advent of large pre-trained neural language models, is brought into scrutiny as several studies began to discuss and report potential biases in NLP applications. Bias in NLP is found to originate from latent historical biases encoded by humans into textual data which gets perpetuated or even amplified by NLP algorithm. We present a survey to comprehend bias in large pre-trained language models, analyze the stages at which they occur in these models, and various ways in which these biases could be quantified and mitigated. Considering wide applicability of textual affective computing based downstream tasks in real-world systems such as business, healthcare, education, etc., we give a special emphasis on investigating bias in the context of affect (emotion) i.e., Affective Bias, in large pre-trained language models. We present a summary of various bias evaluation corpora that help to aid future research and discuss challenges in the research on bias in pre-trained language models. We believe that our attempt to draw a comprehensive view of bias in pre-trained language models, and especially the exploration of affective bias will be highly beneficial to researchers interested in this evolving field.
- Research Article
- 10.1103/physrevphyseducres.21.010153
- May 23, 2025
- Physical Review Physics Education Research
[This paper is part of the Focused Collection in Artificial Intelligence Tools in Physics Teaching and Physics Education Research.] We present a study in which a version of a common conservation of mechanical energy introductory physics problem, an object released on an inclined plane, is given to OpenAI’s GPT-4 large language model (LLM). We investigate how different permutations of object, action verb, and property of the incline impact the responses of the LLM. The problem setup and prompting was left purposefully minimal, requiring the LLM to state multiple assumptions to justify the final answer. We specifically studied how different keywords lead the LLM to analyze the system as rolling versus sliding and how this may be different from physics experts and novice learners. We found that domain-specific terminology may impact the LLM differently from students. Even for correct answers, it generally did not specify the necessary assumptions it needed to state to come to this solution, falling short of what would be expected from an expert instructor. When conflicting information was provided, the LLM generally did not indicate that was the case in its responses. Both issues are weaknesses that could be remedied by additional prompting; however, they remain shortcomings in the context of physics teaching. While specific to introductory physics, this study provides insight into how LLMs respond to variations of a problem within a specific topic area and how their strengths and weaknesses may differ from those of humans. Understanding these differences, and tracking them as LLMs change in their capabilities, is crucial for assessing the impact of artificial intelligence on education. Published by the American Physical Society 2025
- Conference Article
- 10.24963/ijcai.2024/1045
- Aug 1, 2024
While traditional search systems have mostly been satisfactorily relying on lexical based sparse retrievers such as BM25, recent research advances in neural models, the current day large language models (LLMs) hold good promise for practical search applications as well. In this work, we discuss a collaboration between IBM and National Library of Australia to upgrade an existing search application (referred to as NLA) over terabytes of Australian Web Archive data and serving thousands of daily users. We posit and demonstrate both empirically and through qualitative user studies that LLMs and neural models can indeed provide good gains, when combined effectively with traditional search. We believe this demonstration will show the unique challenges associated with real world practical deployments and also offer valuable insights into how to effectively upgrade legacy search applications in the era of LLMs.
- Research Article
- 10.1186/s13326-025-00331-8
- May 23, 2025
- Journal of Biomedical Semantics
BackgroundVaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.ResultsWe used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine’s AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin’s method. Text embeddings were generated for each vaccine’s AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is “Live” or “Non-Live”. The term “Non-Live” refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.ConclusionThis study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.
- Book Chapter
47
- 10.1007/978-3-031-11644-5_13
- Jan 1, 2022
We investigate the utility of large pretrained language models (PLMs) for automatic educational assessment question generation. While PLMs have shown increasing promise in a wide range of natural language applications, including question generation, they can generate unreliable and undesirable content. For high-stakes applications such as educational assessments, it is not only critical to ensure that the generated content is of high quality but also relates to the specific content being assessed. In this paper, we investigate the impact of various PLM prompting strategies on the quality of generated questions. We design a series of generation scenarios to evaluate various generation strategies and evaluate generated questions via automatic metrics and manual examination. With empirical evaluation, we identify the prompting strategy that is most likely to lead to high-quality generated questions. Finally, we demonstrate the promising educational utility of generated questions using our concluded best generation strategy by presenting generated questions together with human-authored questions to a subject matter expert, who despite their expertise, could not effectively distinguish between generated and human-authored questions.
- Research Article
1
- 10.3233/shti240887
- Sep 24, 2024
- Studies in health technology and informatics
Social media offers a rich source of real-time health data, including potential vaccine reactions. However, extracting meaningful insights is challenging due to the noisy nature of social media content. This paper explores using large language models (LLMs) and prompt engineering to detect personal mentions of vaccine reactions. Different prompting strategies were evaluated on two LLM models (GPT-3.5 and GPT-4) using Reddit data focused on shingles (zoster) vaccines. Zero-shot and few-shot learning approaches with both standard and chain-of-thought prompts were compared. The findings demonstrate that GPT-based models with carefully crafted chain-of-thought prompts could identify the relevant social media posts. Few-shot learning helped GPT4 models to identify more of the marginal cases, although less precisely. The use of LLMs for classification with lightweight supervised pretrained language models (PLMs) found that PLMs outperform LLMs. However, a potential benefit in using LLMs to help identify records for training PLMs was revealed, especially to eliminate false negatives, and LLMs could be used as classifiers when insufficient data exists to train a PLM.
- Research Article
1
- 10.1016/j.nlp.2024.100062
- Mar 5, 2024
- Natural Language Processing Journal
Understanding latent affective bias in large pre-trained neural language models
- Research Article
1
- 10.1145/3711857
- Feb 26, 2025
- ACM Transactions on Information Systems
Incorporating explicit personas into dialogue models is critical for generating responses that fulfill specific user needs and preferences, creating a more personalized and engaging interaction. Early works on persona-based dialogue generation directly concatenate the persona descriptions and dialogue history into relatively small pre-trained language models (PLMs) for response generation, which leads to uninformative and inferior results due to the sparse persona information and the limited model generation capabilities. Recently, large language models (LLMs) have shown their surprising capabilities in language generation. Prompting the LLMs with the persona descriptions for role-playing dialogue generation has also achieved promising results. However, deploying LLMs is challenging for practical applications due to their large scale, spurring efforts to distill the generation capabilities into more concise and compact models through teacher-student learning. In this article, we propose an efficient compact K nowledge-grounded P ersona-based D ialogue model enhanced by LLM D istillation (KPDD). Specifically, first, we propose to enrich the annotated persona descriptions by integrating external knowledge graphs (KGs) with a mixed encoding network, coupled with a mixture of experts (MoE) module for both informative and diverse response generation. The mixed encoding network contains multiple layers of modality interaction operations, enabling information from both modalities propagates to the other. Second, to fully exploit the generation capabilities of LLMs, we turn to the distillation technique to improve the generation capabilities of our model, facilitated by a natural language inference (NLI)-based filtering mechanism to extract high-quality information from LLMs. In addition, we employ a curriculum learning strategy to train our model on the high-quality filtered distilled data and progressively on the relatively noisy original data, enhancing its adaptability and performance. Extensive experiments show that KPDD outperforms state-of-the-art baselines in terms of both automatic and human evaluation.
- Research Article
- 10.1038/s41598-024-80571-3
- Nov 24, 2024
- Scientific Reports
In human speakers’ daily conversations, what we do not say matters. We not only compute the literal semantics but also go beyond and draw inferences from what we could have said but chose not to. How well is this pragmatic reasoning process represented in pre-trained large language models (LLM)? In this study, we attempt to address this question through the lens of manner implicature, a pragmatic inference triggered by a violation of the Grice manner maxim. Manner implicature is a central member of the class of context-sensitive phenomena. The current work investigates to what extent pre-trained LLMs are able to identify and tease apart different shades of meaning in manner implicature. We constructed three metrics to explain LLMs’ behavior, including LLMs-surprisals, embedding vectors’ similarities, and natural language prompting. Results showed no striking evidence that LLMs have explainable representations of meaning. First, the LLMs-surprisal findings suggest that some LLMs showed above chance accuracy in capturing different dimensions of meaning, and they were able to differentiate neutral relations from entailment or implications, but they did not show consistent and robust sensitivities to more nuanced comparisons, such as entailment versus implications and equivalence versus entailment. Second, the similarity findings suggest that the perceived advantage of contextual over static embeddings was minimal, and contextual LLMs did not notably outperform static GloVe embeddings. LLMs and GloVe showed no significant difference, though distinctions between entailment and implication were slightly more observable in LLMs. Third, the prompting findings suggest no further supportive evidence indicating LLM’s competence in fully representing different shades of meaning. Overall, our study suggests that current dominant pre-training paradigms do not seem to lead to significant competence in manner implicature within our models. Our investigation sheds light on the design of datasets and benchmark metrics driven by formal and distributional linguistic theories.
- Preprint Article
- 10.48550/arxiv.2305.18324
- May 22, 2023
- arXiv (Cornell University)
A common way to use large pre-trained language models for downstream tasks is to fine tune them using additional layers. This may not work well if downstream domain is a specialized domain whereas the large language model has been pre-trained on a generic corpus. In this paper, we discuss the use of regular expression patterns employed as features for domain knowledge during the process of fine tuning, in addition to domain specific text. Our experiments on real scenario production data show that this method of fine tuning improves the downstream text classification tasks as compared to fine tuning only on domain specific text. We also show that the use of attention network for fine tuning improves results compared to simple linear layers.
- Conference Article
100
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
- Book Chapter
- 10.3233/faia240787
- Oct 16, 2024
In the current digital era, the rapid spread of misinformation on online platforms presents significant challenges to societal well-being, public trust, and democratic processes, influencing critical decision making and public opinion. To address these challenges, there is a growing need for automated fake news detection mechanisms. Pre-trained large language models (LLMs) have demonstrated exceptional capabilities across various natural language processing (NLP) tasks, prompting exploration into their potential for verifying news claims. Instead of employing LLMs in a non-agentic way, where LLMs generate responses based on direct prompts in a single shot, our work introduces FactAgent, an agentic approach of utilizing LLMs for fake news detection. FactAgent enables LLMs to emulate human expert behavior in verifying news claims without any model training, following a structured workflow. This workflow breaks down the complex task of news veracity checking into multiple sub-steps, where LLMs complete simple tasks using their internal knowledge or external tools. At the final step of the workflow, LLMs integrate all findings throughout the workflow to determine the news claim’s veracity. Compared to manual human verification, FactAgent offers enhanced efficiency. Experimental studies demonstrate the effectiveness of FactAgent in verifying claims without the need for any training process. Moreover, FactAgent provides transparent explanations at each step of the workflow and during final decision-making, offering insights into the reasoning process of fake news detection for end users. FactAgent is highly adaptable, allowing for straightforward updates to its tools that LLMs can leverage within the workflow, as well as updates to the workflow itself using domain knowledge. This adaptability enables FactAgent’s application to news verification across various domains.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.