Abstract

Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary RI and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of RI is feasible, including comparisons with ordinary RI and principal component analysis. The RI method is well suited for online processing of data streams because relationship weights can be updated incrementally in a fixed-size distributed representation, and inner products can be approximated on the fly at low computational cost. An open source implementation of generalised RI is provided.

Highlights

  • There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life

  • Random indexing is a form of random projection with low computational complexity, thanks to the high sparsity of the index vectors and the straightforward distributed coding of information

  • Ordinary Random indexing (RI) is used in numerous applications in natural language processing, where the possibility to approximate data in a compressed representation that can be updated incrementally at low computational cost and complexity in an online manner is useful

Read more

Summary

Introduction

There is a rapid increase in the annual amount of data that is produced in almost all domains of science, industry, economy, medicine and even everyday life. The number of relationship weights that need to be stored and updated in such applications can be astronomical, and the analysis prohibitive given the large size of the data representation This is the motivation of random indexing (RI) [31], which is a random-projection method that solves such problems by incrementally generating distributional representations that approximate similarities in sets of co-occurrence weights. LSA [14] and HAL [36] are two other prominent examples of vector-space models [52] used for semantic analysis of text In these methods, a co-occurrence matrix is explicitly constructed, and singular value decomposition (SVD) is used to identify the semantic relationships between terms (see [8] for recent examples). We conclude that the possibility to incrementally encode and analyse general co-occurrence relationships at low computational cost using a distributed representation of approximately fixed size makes generalised RI interesting for online processing of data streams

Method
Random indexing
Encoding algorithm
Decoding algorithm
L1 L2 L3
Generalised vector semantic analysis
Simulation experiments
Verification and comparison with PCA
Decoding error and comparison with ordinary RI
Effect of dimension reduction
Effect of sparseness of the index vectors
Natural language processing example
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.