Abstract

Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.

Highlights

  • Software projects produce a lot of textual information, and analysis of this data is a truly significant task for practice [1]

  • In our previous studies [11],[12],[13] we have presented a near duplicate detection approach which is based on software clone detection

  • In this paper we suggest an near duplicate detection algorithm based on N-gram model [1]

Read more

Summary

Introduction

Software projects produce a lot of textual information, and analysis of this data is a truly significant task for practice [1]. There are a number of approaches using this technique in software documentation research [4],[5],[6] These approaches operate only with exact duplicates. In our previous studies [11],[12],[13] we have presented a near duplicate detection approach which is based on software clone detection. We adapted clone detection tool Clone Miner [14] to detect exact duplicates in documents, near duplicates were extracted as combinations of exact duplicates. This approach outcomes a lot of false positives because it can not manage exact duplicate detection and operates with bad-quality “bricks” for combination of near duplicates. The algorithm was evaluated on documentation of five industrial projects

Related Work
Background
The Algorithm
Evaluation
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.