Cross-lingual latent semantic analysis

William Cox,Brandon Pincombe

doi:10.21914/anziamj.v48i0.98

Abstract

Cross-lingual information retrieval is a difficult task typically involving query translation into multiple languages followed by monolingual retrieval in each language. Latent Semantic Analysis allows cross-lingual retrieval without translating queries by working from an already existing corpus of translations. Thus, collecting such a corpus obviates the need to construct complicated translation tools, making this technique particularly applicable to querying less commercially appealing languages. First, we extend work on retrieval from an English-French corpora split into training and test sets to examine the effects of training on a corpus from a completely different. Success is measured by the proportion of direct translations correctly considered most similar by Latent Semantic Analysis. Secondly, an English only similarity task from the literature is also extended to train on a different corpus to the one being tested on. Here the degradation in performance is measured through examining the variation in the correlations between the inter-document similarity judgements calculated by Latent Semantic Analysis and an experimentally derived baseline of human judgements of inter-document similarity. Higher order indexing schemes discarding uncommon terms, sparse matrix representations and the removal of factors with very low eigenvalues are used to enhance efficiency. Performance degradation from exogenous training is shown in both cases. The best results occur using stopping, log-entropy weighting and over 500 factors. References K. Boerner. Extracting and visualizing semantic structures in retrieval results for browsing. In Peter J. Nuernberg, David L. Hicks and Richard Furuta, editors, Proceedings of the fifth ACM conference on Digital libraries, pages 234--235. ACM 2000. doi:http://doi.acm.org/10.1145/336597.336672 Deerwester, S. C., Dumais, S. T., Landauer, T. K., Fernas, G. W. and Harshman, R. A., Indexing by Latent Semantic Analysis, Journal of the American Society of Information Science, 41, 1990, 391--407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 S. T. Dumais, T. K. Landauer and M. L. Littman. Automatic cross-linguistic information retrieval using Latent Semantic Indexing. In SIGIR'96 - Workshop on Cross-Linguistic Information Retrieval, pages 16--23. ACM, 1996. T. K. Landauer and M. L. Littman. Fully automatic cross-language document retrieval using latent semantic indexing. In Gregory Grefenstette, editor, Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pages 31--38. UW Centre for the New OED and Text Research, Waterloo Ontario, 1990. Landauer, T. K., Littman, M. L. and Stornetta, W. S., A statistical method for cross-language information retrieval. Unpublished manuscript, 1992. Landauer, T. K., Foltz, P. W. and Laham, D., Introduction to Latent Semantic Analysis, Discourse Processes, textbf{25}, 1998, 259--284. Lloyd, R. and Shakiban, C., Improvements in Latent Semantic Analysis, American Journal of Undergraduate Research, 3, 2004, 29--34. http://www.ajur.uni.edu/v3n2 B. Pincombe. Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus. Research Report DSTO-RR-0278. DSTO, 2004. http://dspace.dsto.defence.gov.au/dspace/bitstream/1947/3334/1/DSTO-RR-0278%0PR.pdf P. G. Young. Cross-language information retrieval using latent semantic indexing. Technical Report UT-CS-94-259. University of Tennessee, 1994. M. D. Lee, B. M. Pincombe and M. B. Welsh. An empirical evaluation of models of text document similarity. In Bruno G. Bara, Lawrence Barsalou and Monica Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1254--1259. Lawrence Erlbaum Associates, Mahwah, NJ, 2005. http://hdl.handle.net/2440/28910

Full Text