Abstract

Text similarity measures play a very important role in several text mining applications. Although there is an extensive literature on measuring the similarity between long texts, there is less work related to the measurement of similarity between short texts. And most of these works on short text similarity are based on adaptations of long-text similarity methods. Unfortunately, the description of a trouble ticket is just a kind of short texts. Thus, ticket mining applications such as ticket classification, ticket clustering, and ticket resolution recommendation often suffer from poor performance because of tickets’ particular characteristics of unstructured, short free-text with large vocabulary size, large volume, non-English dictionary words, and so on. Therefore, the ability to accurately measure the similarity between two tickets is critical to the performance of ticket mining.To address this performance issue, this paper proposes a multi-view similarity measure framework that easily integrates several kinds of existing similarity measures including surface matching based measures, semantic similarity measures and syntax based measures. Further, in order to make full use of the strengths of different similarity measures, our framework adopts four different policies to combine them. In particular, we consider a machine learning based policy that can be applied to integrate various similarity measures in a more general way, which makes our framework flexible and extensible. To demonstrate the effectiveness of measures generated from our framework, we empirically validate them on a publicly available short text data set and apply them to a real-world ticket data set from a large enterprise IT infrastructure. Some important findings obtained via the result analysis will be helpful to further improve performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call