With the development of Deep Learning, Natural Language Processing (NLP) applications have reached or even exceeded human-level capabilities in certain tasks. Although NLP applications have shown good performance, they can still have bugs like traditional software and even lead to serious consequences. Inspired by Lego blocks and syntax structure analysis, we propose an assembling test generation method for NLP applications or models and implement it in NLPLego . The key idea of NLPLego is to assemble the sentence skeleton and adjuncts in order by simulating the building of Lego blocks to generate multiple grammatically and semantically correct sentences based on one seed sentence. The sentences generated by NLPLego have derivation relations and different degrees of variation. These characteristics make it well-suited for integration with metamorphic testing theory, addressing the challenge of test oracle absence in NLP application testing. To validate NLPLego , we conduct experiments on three commonly used NLP tasks (i.e., machine reading comprehension, sentiment analysis, and semantic similarity measures), focusing on the efficiency of test generation and the quality and effectiveness of generated tests. We select five advanced NLP models and one popular industrial NLP software as the tested subjects. Given seed tests from SQuAD 2.0, SST, and QQP, NLPLego successfully detects 1,732, 3,140, and 261,879 incorrect behaviors with around 93.1% precision in three tasks, respectively. The experiment results show that NLPLego can efficiently generate high-quality tests for multiple NLP tasks to detect erroneous behaviors effectively. In the case study, we analyze the testing results provided by NLPLego to obtain intuitive representations of the different NLP capabilities of the tested subjects. The case study confirms that NLPLego can provide developers with clarity on the direction to improve NLP models or applications, laying the foundation for enhancing performance.
Read full abstract