Reliable and Content-specific Support for Keyword Selection through AI and Statistics

Tom Strube,Tom Nowak,Mariia Pokotylo,Bernd Kuhlenkötter

doi:10.1515/cdbme-2024-2154

Tom Strube, Tom Nowak + Show 2 more

https://doi.org/10.1515/cdbme-2024-2154

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Abstract Due to the recent popularity and availability of Large Language Models (LLMs), creators of educational materials can more efficiently extract keywords for use in personalised learning recommendations than ever before. However, due to the LLMs’ probabalistic nature, the automation of the otherwise labour-intense keyword extraction inherits the risk of biased and non-explainable results. In this research, we present an original framework to enhance keyword selection based on content title and description through a novel, reliability-sensitive, keyword selection algorithm. For this, we collected 38 potential keywords (together with their definitions) for five topics on dementia care from previous studies, together with two contents per topic. To assess the new method’s support in extracting keywords, we then prompted 5 human experts and 3 LLMs (using Retrieval Augmented Generation (RAG) for the keyword definitions) to select keywords to include and exclude for each content. Using Krippendorf’s a metric, we then were able to adapt to the present agreement, and to reliably select keyword sets for inclusion and exclusion for each content individually. Last, we compared these LLM-based keyword sets with those selected by humans to assess the impact of the adaptive keyword selection algorithm. Overall, the results suggest that LLMs generally struggle with the task (66% of extraction attempts either contained hallucinated or did not return any keywords), and topic-wise internal agreement is low ( a=0.59 (0.42) for model 3 (using RAG) on average; a=0.68 for human raters). Due to this, the reliable keyword selection resulted in a median set of 6|27 keywords for inclusion|exclusion per topic, with many of those keywords being within the benchmark keyword sets selected by human raters. To conclude, this approach shows effective in adapting to different levels of agreement in extracting keywords.

Full Text