Random indexing of multidimensional data

Fredrik Sandin,Magnus Sahlgren,Blerim Emruli

doi:10.1007/s10115-016-1012-2

Abstract

Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary RI and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of RI is feasible, including comparisons with ordinary RI and principal component analysis. The RI method is well suited for online processing of data streams because relationship weights can be updated incrementally in a fixed-size distributed representation, and inner products can be approximated on the fly at low computational cost. An open source implementation of generalised RI is provided.

Highlights

There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life
Random indexing is a form of random projection with low computational complexity, thanks to the high sparsity of the index vectors and the straightforward distributed coding of information
Ordinary Random indexing (RI) is used in numerous applications in natural language processing, where the possibility to approximate data in a compressed representation that can be updated incrementally at low computational cost and complexity in an online manner is useful

Summary

Introduction

There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life. The number of relationship weights that need to be stored and updated in such applications can be astronomical, and the analysis prohibitive given the large size of the data representation This is the motivation of random indexing (RI) [31], which is a random-projection method that solves such problems by incrementally generating distributional representations that approximate similarities in sets of co-occurrence weights. LSA [14] and HAL [36] are two other prominent examples of vector-space models [52] used for semantic analysis of text In these methods, a co-occurrence matrix is explicitly constructed, and singular value decomposition (SVD) is used to identify the semantic relationships between terms (see [8] for recent examples). We conclude that the possibility to incrementally encode and analyse general co-occurrence relationships at low computational cost using a distributed representation of approximately fixed size makes generalised RI interesting for online processing of data streams

Method

Random indexing

Encoding algorithm

Decoding algorithm

L1 L2 L3

Generalised vector semantic analysis

Simulation experiments

Verification and comparison with PCA

Decoding error and comparison with ordinary RI

Effect of dimension reduction

Effect of sparseness of the index vectors

Natural language processing example

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Knowledge and Information Systems	Publication Date: Dec 7, 2016
Citations: 11	License type: open-access

R Discovery Prime

R Discovery Prime

Random indexing of multidimensional data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Knowledge and Information Systems

Lead the way for us

Similar Papers

Mapping of Fe mineral potential by spatially weighted principal component analysis in the eastern Tianshan mineral district, China
Jie Zhao ... Frits Agterberg
Journal of Geochemical Exploration | VOL. 164
Jie Zhao, et. al.Jie Zhao ... Frits Agterberg
10 Nov 2015
Journal of Geochemical Exploration | VOL. 164

Linear dimension reduction of sequences of medical images: III. Factor analysis in signal space
Flemming Hermansen ... Adriaan A Lammertsma
Physics in Medicine & Biology | VOL. 41
Flemming Hermansen, et. al.Flemming Hermansen ... Adriaan A Lammertsma
01 Aug 1996
Physics in Medicine & Biology | VOL. 41

Multi-way principal components-and PLS-analysis
Svante Wold ... Paul Geladi
Journal of Chemometrics | VOL. 1
Svante Wold, et. al.Svante Wold ... Paul Geladi
01 Jan 1987
Journal of Chemometrics | VOL. 1

A Sparse Common Spatial Pattern Algorithm for Brain-Computer Interface
Li-Chen Shi ... Rui-Hua Sun
-
Li-Chen Shi, et. al.Li-Chen Shi ... Rui-Hua Sun
01 Jan 2010
01 Jan 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Random indexing of multidimensional data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Knowledge and Information Systems