Building k-nn graphs from large text data

Thibault Debatty,Pietro Michiardi,Olivier Thonnard,Wim Mees

doi:10.1109/bigdata.2014.7004276

Abstract

In this paper we present our new design of NNCTPH, a scalable algorithm to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses NN-Descent, a versatile graph-building algorithm, inside each bucket. We use datasets consisting of the subject of spam emails to experimentally test the influence of the different parameters of the algorithm on the number of computed similarities, on processing time, and on the quality of the final graph. We also compare the algorithm with a sequential and a MapReduce implementation of NN-Descent. For our datasets, the algorithm proved to be up to ten times faster than NN-Descent, for the same quality of produced graph. Moreover, the speedup increased with the size of the dataset, making NNCTPH a sensible choice for very large text datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Building k-nn graphs from large text data

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Oct 1, 2014
Citations: 15	License type: other-oa

Similar Papers

A scalable and dynamic self-organizing map for clustering large volumes of text data
Sumith Matharage ... Damminda Alahakoon
-
Sumith Matharage, et. al.Sumith Matharage ... Damminda Alahakoon
01 Aug 2013
01 Aug 2013

Incremental technique with set of frequent word item sets for mining large Indonesian text data
Dian Sa'Adillah Maylawati ... Ali Rahman
-
Dian Sa'Adillah Maylawati, et. al.Dian Sa'Adillah Maylawati ... Ali Rahman
01 Aug 2017
01 Aug 2017

Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters
Radha Chitta ... Anil K Jain
-
Radha Chitta, et. al.Radha Chitta ... Anil K Jain
18 Oct 2015
18 Oct 2015

Solving Social Media Text Classification Problems Using Code Fragment-Based XCSR
Muhammad Hassan Arif ... Muhammad Iqbal
-
Muhammad Hassan Arif, et. al.Muhammad Hassan Arif ... Muhammad Iqbal
01 Nov 2017
01 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Building k-nn graphs from large text data

Abstract

Talk to us

Similar Papers