Обнаружение неточно повторяющегося текста в документации программного обеспечения

L D Kanteev ,Dmitrij Koznov ,Dmitry Luciv ,Yu O Kostyukov ,М Н Смирнов

doi:10.15514/ispras-2017-29(4)-21

Abstract

Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.

Highlights

Software projects produce a lot of textual information, and analysis of this data is a truly significant task for practice [1]
In our previous studies [11],[12],[13] we have presented a near duplicate detection approach which is based on software clone detection
In this paper we suggest an near duplicate detection algorithm based on N-gram model [1]

Summary

Introduction

Software projects produce a lot of textual information, and analysis of this data is a truly significant task for practice [1]. There are a number of approaches using this technique in software documentation research [4],[5],[6] These approaches operate only with exact duplicates. In our previous studies [11],[12],[13] we have presented a near duplicate detection approach which is based on software clone detection. We adapted clone detection tool Clone Miner [14] to detect exact duplicates in documents, near duplicates were extracted as combinations of exact duplicates. This approach outcomes a lot of false positives because it can not manage exact duplicate detection and operates with bad-quality “bricks” for combination of near duplicates. The algorithm was evaluated on documentation of five industrial projects

Related Work

Background

The Algorithm

Evaluation

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Обнаружение неточно повторяющегося текста в документации программного обеспечения

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS

Lead the way for us

Journal: Proceedings of the Institute for System Programming of the RAS	Publication Date: Jan 1, 2017
License type: cc-by

Similar Papers

Duplicate finder toolkit
Dmitry Luciv ... Konstantin Romanovsky
-
Dmitry Luciv, et. al.Dmitry Luciv ... Konstantin Romanovsky
27 May 2018
27 May 2018

How Do Open Source Communities Document Software Architecture: An Exploratory Survey
Wei Ding ... Antony Tang
-
Wei Ding, et. al.Wei Ding ... Antony Tang
01 Aug 2014
01 Aug 2014

Multilevel weighted enhancement for underwater image dehazing.
Kuldeep Purohit ... A N Rajagopalan
Journal of the Optical Society of America A | VOL. 36
Kuldeep Purohit, et. al.Kuldeep Purohit ... A N Rajagopalan
31 May 2019
Journal of the Optical Society of America A | VOL. 36

Multi-scale rock surface area quantification—a systematic method to evaluate the reactive surface area of rocks
Cornelius Fischer ... Reinhard Gaupp
Chemie der Erde - Geochemistry - Interdisciplinary Journal for Chemical Problems of the Geosciences and Geoecology | VOL. 64
Cornelius Fischer, et. al.Cornelius Fischer ... Reinhard Gaupp
10 Aug 2004
Chemie der Erde - Geochemistry - Interdisciplinary Journal for Chemical Problems of the Geosciences and Geoecology | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Обнаружение неточно повторяющегося текста в документации программного обеспечения

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS