Abstract
The technique of latent semantic indexing (LSI) has wide applicability in information retrieval and data mining tasks. To date, however, most applications of LSI have addressed relatively small collections of data. This has been due partly to hardware and software limitations and partly to overly pessimistic estimates of the processing requirements of the singular value decomposition (SVD) process. In recent years, advances in hardware capabilities and software implementations have enabled much larger LSI applications. Moreover, experience with large LSI indexes has shown that the SVD is not the limitation on scalability that it was long thought to be. This paper describes techniques applicable to creating large-scale (multi-million document) LSI indexes. Detailed data regarding the LSI index creation process is presented for collections of up to 100 million documents. Four key factors are shown to contribute to the scalability of LSI. First, in most situations, the time required for calculation of the singular value decomposition (SVD) of the term-document matrix is not the dominant factor determining the overall time required to build an LSI index. Second, the time required to calculate the SVD in LSI is linear in the number of objects indexed. Third, incremental index creation greatly facilitates use of LSI in dynamic environments. Fourth, distributed query processing can be employed to support large numbers of users. It is shown that LSI is well-suited for implementation in modern distributed computing environments. This paper provides the first measurements of the execution time for large-scale LSI build processes in a cloud environment.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.