Aspects of creating a corporate question-and-answer system using generative pre-trained language models

Aleksei Golikov,Maksim Romanovskii,Dmitrii Akimov,Sergei Trashchenkov

doi:10.25136/2409-8698.2023.12.69353

Abstract

The article describes various ways to use generative pre-trained language models to build a corporate question-and-answer system. A significant limitation of the current generative pre-trained language models is the limit on the number of input tokens, which does not allow them to work "out of the box" with a large number of documents or with a large document. To overcome this limitation, the paper considers the indexing of documents with subsequent search query and response generation based on two of the most popular open source solutions at the moment – the Haystack and LlamaIndex frameworks. It has been shown that using the open source Haystack framework with the best settings allows you to get more accurate answers when building a corporate question-and-answer system compared to the open source LlamaIndex framework, however, requires the use of an average of several more tokens. The article used a comparative analysis to evaluate the effectiveness of using generative pre-trained language models in corporate question-and-answer systems using the Haystack and Llamaindex frameworks. The evaluation of the obtained results was carried out using the EM (exact match) metric. The main conclusions of the conducted research on the creation of question-answer systems using generative pre-trained language models are: 1. Using hierarchical indexing is currently extremely expensive in terms of the number of tokens used (about 160,000 tokens for hierarchical indexing versus 30,000 tokens on average for sequential indexing), since the response is generated by sequentially processing parent and child nodes. 2. Processing information using the Haystack framework with the best settings allows you to get somewhat more accurate answers than using the LlamaIndex framework (0.7 vs. 0.67 with the best settings). 3. Using the Haystack framework is more invariant with respect to the accuracy of responses in terms of the number of tokens in the chunk. 4. On average, using the Haystack framework is more expensive in terms of the number of tokens (about 4 times) than the LlamaIndex framework. 5. The "create and refine" and "tree summarize" response generation modes for the LlamaIndex framework are approximately the same in terms of the accuracy of the responses received, however, more tokens are required for the "tree summarize" mode.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Aspects of creating a corporate question-and-answer system using generative pre-trained language models

Abstract

Talk to us

Similar Papers

More From: Litera

Lead the way for us

Journal: Litera	Publication Date: Dec 1, 2023
License type: cc-by-nc

Similar Papers

Evaluating generative patent language models
Jieh-Sheng Lee
World Patent Information | VOL. 72
Jieh-Sheng LeeJieh-Sheng Lee
30 Jan 2023
World Patent Information | VOL. 72

CodeBERT-nt: Code Naturalness via CodeBERT
Ahmed Khanfir ... Yves Le Traon
-
Ahmed Khanfir, et. al.Ahmed Khanfir ... Yves Le Traon
01 Dec 2022
01 Dec 2022

Investigating strategies for lexical complexity prediction in a multilingual setting using generative language models and supervised approaches
Abdelhak Kelious ... Christophe Coeur
-
Abdelhak Kelious, et. al.Abdelhak Kelious ... Christophe Coeur
15 Oct 2024
15 Oct 2024

A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis
...
-
, et. al. ...
27 Jun 2022
27 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Aspects of creating a corporate question-and-answer system using generative pre-trained language models

Abstract

Talk to us

Similar Papers

More From: Litera