Improving dense retrieval models with LLM augmented data for dataset search

Levy Silva,Luciano Barbosa

doi:10.1016/j.knosys.2024.111740

Abstract

Data augmentation for training supervised models has achieved great results in different areas. With the popularity of Large Language Models (LLMs), a research area has emerged focused on applying LLMs for text data augmentation. This approach is particularly beneficial for low-resource tasks, whereby the availability of labeled data is very scarce. Dataset search is an information retrieval task that aims to retrieve relevant datasets based on user queries. However, due to the lack of labeled data tailored explicitly for this task, developing accurate retrieval models becomes challenging. In this paper, we target LLMs to create training examples for retrieval models in the dataset search task. Specifically, we propose a new pipeline that generates synthetic queries from dataset descriptions using LLMs. The query-description pairs are utilized to fine-tune dense retrieval approaches for re-ranking, which we assume as soft matches to our task. We evaluated our pipeline using fine-tuned embedding models for semantic search over dataset search benchmarks (NTCIR and ACORDAR). We tuned these models in the dataset search task using the synthetic data generated by our solution and compared their performance with the original models. The results show the models tuned on the synthetic data statistically outperform the baselines at different normalized discounted cumulative gain levels.

Full Text