Abstract

Query translation mining is a key technique in cross-language information retrieval and machine translation knowl-edge acquisition. For better performance, the queries are classified into transliterated words and non-transliterated words based on transliterated word identification model, and are further channeled to different mining processes. This paper is a pilot study on query classification for better translation mining performance, which is based on supervised classification and linguistic heuristics. The person name identification gets a precision of over 97%. Transliterated word translation mining shows satisfactory performance.

Highlights

  • For both cross-language information retrieval and machine translation knowledge acquisition, translation mining for out-of-vocabulary words is an important module which can help translate named-entities, organization and location names, book and movie titles, technical terms, and newly-coined words that are not included in the dictionary.The web is a rich mineral for translation mining based on co-occurrence statistics

  • Query translation mining is a key technique in cross-language information retrieval and machine translation knowledge acquisition

  • Result analysis shows that transliteration characters feature is a good feature for transliterated word identification, which can serve as a basis for transliteration word identification

Read more

Summary

Introduction

For both cross-language information retrieval and machine translation knowledge acquisition, translation mining for out-of-vocabulary words is an important module which can help translate named-entities, organization and location names, book and movie titles, technical terms, and newly-coined words that are not included in the dictionary. Besides the co-occurrence statistics, natural language processing techniques such as word alignment is utilized in recent research work. All query terms go through the same process for translation mining, which omits the difference between transliterated words and non-transliterated words. In present researches on query translation mining, transliterated words are not separated from non-transliterated words. This method leads to a compromised solution in the modeling. A method is proposed in this paper to decide whether the query word is a transliterated word or not, which utilizes a unigram-based transliteration statistics plus some heuristic rules. The section of the paper describes the unigrambased transliteration identification modeling based on a supervised-learning process. The last section concludes the method and describes future works in the field

A Unigram-Based Transliteration Identification Model
Transliteration Features of Chinese Characters
Models and the Algorithm for Transliteration Word Identification
Query Translation Mining from Search Engine Snippets
Transliteration Model
The Experiment Setup and Result Analysis
Experiment on Transliterated Word Identification
Experiment on Translation Mining from Search Engine Snippets
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call