Abstract

Detecting similarity between two source code bases or inside one code base has many applications in the area of plagiarism detection and reused code which is manageable for refactoring. In this paper, State of the art techniques: Levenshtein Distance, Cosine Similarity, Hamming Distance and ASCII based hashing and Rabin–Karp rolling hashing have been investigated on source code strings, which is an extended work to already published research work. From experimentation, it has been observed that Rabin–Karp hashing performs better than other techniques in terms of running time, accuracy and type-of-clones. All techniques face one issue of increase in similarity searching time linearly with database size, whereas Rabin–Karp hashing handled this issue efficiently. Moreover, Rabin–Karp rolling hash method reported minimum false positives and it is also able to manage multiple patterns at a time.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call