Abstract

Instruction attack is a malicious attempt to manipulate a chatbot by providing misleading or harmful prompts to achieve unintended outcomes. Detecting instruction attacks is crucial to protect the integrity and safety of chatbot interactions. In this study, we focus on identifying different types of instruction attacks which includes Goal Hijacking, Prompt Leaking, Reverse Exposure, Role Play Instruction and Unsafe Instruction Topic. Given the widening threat scope and the lack of research thus far in this field in a Thai language-oriented context, our intentions are to develop an effective defence system. We suggest an innovative approach: combining XLM-RoBERTa, a state-of-the art language model, with a Bidirectional Gated Recurrent Unit (Bi-GRU). By combining rigorous experimentation and comprehensive evaluation, our method provides outstanding accuracy of 96.52% , precision 96.50% , Recall and F1-score 96.41%. This research contributes to creating a safer and more trustworthy environment for chatbot-mediated interactions in the Thai language context.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call