AbstractNumerous academic research projects and industrial tasks related to software engineering require individual requirements as input. Unfortunately, according to our observation, several requirements may be packed in one paragraph without explicit boundaries in specification documents. To understand this problem's prevalence, we performed a preliminary study on the open requirement documents widely used in the academic community over the last 10 years, and found that 26% of them include this phenomenon. Several text segmentation approaches have been reported; however, they tend to identify topically coherent units which may contain more than one requirement. What is more, they do not take the constitutions of semantic units of requirements into consideration. Here we report a two‐phase learning‐based approach named DRIP to segment individual requirements from paragraphs. To be specific, we first propose a Requirement Segmentation Siamese framework, which models the similarity of sentences and their conjunction relations, and then detects the initial boundaries between individual requirements. Then, we optimize the boundaries heuristically based on the semantic completeness validation of the segments. Experiments with 1132 paragraphs and 6826 sentences show that DRIP outperforms the popular unsupervised and supervised text segmentation algorithms with respect to processing different documents (with accuracy gains of 57.65%–187.53%) and processing paragraphs of different complexity (with average accuracy gains of 54.46%–158.68%). We also show the importance of each component of DRIP to the segmentation.
Read full abstract