Evaluating Text Quality of GPT Engine Davinci-003 and GPT Engine Davinci Generation Using BLEU Score

Yayan Heryanto,Agung Triayudi

doi:10.58905/saga.v1i4.213

Abstract

The improvement of text generation based on language models has witnessed significant progress in the field of natural language processing with the use of Transformer-based language models, such as GPT (Generative Pre-trained Transformer). In this study, we conduct an evaluation of text quality using the BLEU (Bilingual Evaluation Understudy) score for two prominent GPT engines: Davinci-003 and Davinci. We generated questions and answers related to Python from internet sources as input data. The BLEU score comparison revealed that Davinci-003 achieved a higher score of 0.035, while Davinci attained a score of 0.021. Additionally, for the response times, with Davinci demonstrating an average response time of 4.20 seconds, while Davinci-003 exhibited a slightly longer average response time of 6.59 seconds. The decision of whether to use Davinci-003 or Davinci for chatbot development should be made based on the specific project requirements. If prioritizing text quality is paramount, Davinci-003 emerges as the superior choice due to its higher BLEU score. However, if faster response times are of greater importance, Davinci may be the more suitable option. Ultimately, the selection should align with the unique needs and objectives of the chatbot development project.

Full Text