A Heterogeneous System Based on Latent Semantic Analysis Using GPU and Multi-CPU

Gabriel A León-Paredes,Liliana I Barbosa-Santillán,Juan J Sánchez-Escobar

doi:10.1155/2017/8131390

Abstract

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing the term-by-document matrix using the Singular Value Decomposition (SVD) technique. However, LSA has a high computational cost for analyzing large amounts of information. The goals of this work are (i) to improve the execution time of semantic space construction, dimensionality reduction, and information retrieval stages of LSA based on heterogeneous systems and (ii) to evaluate the accuracy and recall of the information retrieval stage. We present a heterogeneous Latent Semantic Analysis (hLSA) system, which has been developed using General-Purpose computing on Graphics Processing Units (GPGPUs) architecture, which can solve large numeric problems faster through the thousands of concurrent threads on multiple CUDA cores of GPUs and multi-CPU architecture, which can solve large text problems faster through a multiprocessing environment. We execute the hLSA system with documents from the PubMed Central (PMC) database. The results of the experiments show that the acceleration reached by the hLSA system for large matrices with one hundred and fifty thousand million values is around eight times faster than the standard LSA version with an accuracy of 88% and a recall of 100%.

Highlights

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing a term-by-document matrix using term weighting schemes such as Log Entropy or Term FrequencyInverse Document Frequency (TF-IDF) and using the Singular Value Decomposition (SVD) technique
We compare the documents retrieved by the heterogeneous Latent Semantic Analysis (hLSA) system based on a text query related to each use case versus the relevant documents defined by the experts
As shown in the results, the best similarities found in the experiments with the hLSA system for use case of bipolar disorders are found with values of k = 50 and accuracy = 0.88, for use case of lupus disease are found with values of k = 25 and accuracy = 0.56, and for use case of topiramate weight-loss are found with values of k = 25 and accuracy = 0.98

Summary

Introduction

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing a term-by-document matrix using term weighting schemes such as Log Entropy or Term FrequencyInverse Document Frequency (TF-IDF) and using the Singular Value Decomposition (SVD) technique. LSA improved one of the main problems of information retrieval techniques, that is, handling polysemous words, by assuming there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice [1]. LSA takes a considerable amount of time to index and to compute the semantic space, when it is applied to large-scale datasets [9,10,11]. If M[w,d] denotes the number of times (frequency) that a word w appears in document d and N is the total number of documents in the dataset,

Objectives

Methods

Results

Conclusion