Abstract

The detection of duplicate bug reports can help reduce the processing time of handling field crashes. This is especially important for software companies with a large client base where multiple customers can submit bug reports, caused by the same faults. There exist several techniques for the detection of duplicate bug reports; many of them rely on some sort of classification techniques applied to information extracted from stack traces. They classify each report using functions invoked in the stack trace associated with the bug report. The problem is that typical bug repositories may have stack traces that contain tens of thousands of functions, which causes the curse of dimensionality problem. In this paper, we propose a feature extraction technique that reduces the feature size and yet retains the information that is most critical for the classification. The proposed feature extraction approach starts by abstracting stack traces of function calls into sequences of package names, by replacing each function with the package in which it is defined. We then segment these traces into multiple N-grams of variable length and map them to fixed-size sparse feature vectors, which are used to measure the distance between the stack trace of incoming bug report with a historical set of bug reports stack traces. The linear combination of stack trace similarity and non-textual fields such as component and severity are then used to measure the distance of a bug report with a historical set of bug reports. We show the effectiveness of our approach by applying it to the Eclipse bug repository that contains tens of thousands of bug reports. Our approach outperforms the approach that uses distinct function names, while significantly reducing the processing time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call