Abstract

AbstractDialogue systems are embedded in smartphones and Artificial Intelligence (AI) speakers and are widely used through text and speech. To achieve a human-like dialogue system, one of the challenges is to have a standard automatic evaluation metric. Existing metrics like BLEU, METEOR, and ROUGE have been proposed to evaluate dialogue system. However, those methods are biased and correlate very poorly with human judgements of response quality. On the other hand, RUBER is applied to not only train the relatedness between the dialogue system generated reply and given query, but also measure the similarity between the ground truth and generated reply. It showed higher correlation with human judgements than BLEU and ROUGE. Based on RUBER, instead of static embedding, we explore using BERT contextualised word embedding to get a better evaluation metrics. The experiment shows that our evaluation metrics using BERT are more correlated to human judgement than RUBER. Experimental results show that BERT feature based evaluation metric had 0.31 and 0.26 points higher scores and BERT fine tune evaluation metric got higher 0.39 and 0.36 points in Pearson and Spearman correlation with human judgement score than RUBER, respectively.KeywordsEvaluation metricDialogue systemContextualised embeddingBERT

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call