Abstract
Today, major commercial search engines are operating in a multinational fashion to provide web search services for millions of users who compose search queries by different languages. Hence, the search engine query log, which serves as the backbone of many search engine applications, records millions of users’ search history in a wide spectrum of human languages and demonstrates a strong multilingual phenomenon. However, with its salience, the multilingual nature of a search engine query log is usually ignored by existing works, which usually consider query log entries of different languages as being orthogonal and independent. This kind of oversimplified assumption heavily distorts the underlying structure of web search data. In this article, we pioneer in recognition of the multilingual nature of a query log and make the first attempt to cross the language barrier in query logs. We propose a novel model named Cross-Lingual Query Log Topic Model (CL-QLTM) to analyze query logs from a cross-lingual perspective and derive the latent topics of web search data. The CL-QLTM comprehensively integrates web search data in different languages by collectively utilizing cross-lingual dictionaries, as well as the co-occurrence relations in the query log. In order to relieve the efficiency bottleneck of applying the CL-QLTM on voluminous query logs, we propose an efficient parameter inference algorithm based on the MapReduce computing paradigm. Both qualitative and quantitative experimental results show that the CL-QLTM is able to effectively derive cross-lingual topics from multilingual query logs and spawn a wide spectrum of new search engine applications.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.