TF-SIDF: Term frequency, sketched inverse document frequency

Gladys Castillo,Manuel Baena-Garcia,Jose M. Carmona-Cejudo,Rafael Morales-Bueno

doi:10.1109/isda.2011.6121796

Abstract

Exact calculation of the TF-IDF weighting function in massive streams of documents involves challenging memory space requirements. In this work, we propose TF-SIDF, a novel solution for extracting relevant words from streams of documents with a high number of terms. TF-SIDF relies on the Count-Min Sketch data structure, which allows to estimate the counts of all the terms in the stream. Results of the experiments conducted with two dataset show that this sketch-based algorithm achieves good approximations of the TF-IDF weighting values (as a rule, the top terms with highest TF-IDF values remaining the same), while substantial savings in memory usage are observed. It is also observed that the performance is highly correlated with the sketch size, and that wider sketch configurations are preferable given the same sketch size.

Full Text