Many crime reports are available online in various blogs and Newswire. Though manual annotation of these massive reports is quite tedious for crime data analysis, it gives an overall crime scenario of all over the world. This motivates us to propose a framework for crime data analysis based on the online reports. Initially, the method extracts the crime reports and identifies named entities. The intermediate sequence of context words between every consecutive pair of named entities is termed as a crime vector that provides relationships between the entities. The feature vectors for each entity pair are generated from these crime vectors using the Word2Vec model. The paper considers three different types of named entity pairs to facilitate the major crime data analysis task, and for each type, similarity between every pair of entities is measured using respective feature vectors. For each type of named entity pair, a separate weighted graph is generated with entity pairs as vertices and similarity score between them as the weight of the corresponding edge. Then, Infomap, a graph-based clustering algorithm, is applied to obtain optimal set of clusters of entity pairs and a representative entity pair of each cluster. Each cluster is labelled by the relationship, represented by the crime vector, of its representative entity pair. In reality, all the entity pairs in a cluster may not reflect contextual similarity with their representative entity pair. So the clusters are further partitioned into subclusters based on WordNet-based path similarity measure which makes the entity pairs in each subcluster more contextually similar compared to their original cluster. These subclusters provide us various statistical crime information over the time period. The method is experimented only using the crime reports related to crime against women in India. The experimental results demonstrate the effectiveness and superiority of the method compared to others for analysing the crime data.
Read full abstract