Abstract

Bilingual web pages are widely used to mine translations of unknown terms. This study focused on an effective solution for obtaining relevant web pages, extracting translations with correct lexical boundaries, and ranking the translation candidates. This research adopted co-occurrence information to obtain the subject terms and then expanded the source query with the translation of the subject terms to collect effective bilingual search engine snippets. Afterwards, valid candidates were extracted from small-sized, noisy bilingual corpora using an improved frequency change measurement that combines adjacent information. This research developed a method that considers surface patterns, frequency–distance, and phonetic features to elect an appropriate translation. The experimental results revealed that the proposed method performed remarkably well for mining translations of unknown terms.

Highlights

  • The rapid development of Web 2.0 and constant expansion of the network size has led to a great increase in the amount of informational resources in multiple languages on the Internet

  • Cross-language information retrieval (CLIR) enables people to retrieve documents written in multiple languages through a single query

  • This study proposed an approach for mining the translations of unknown terms based on the web and it is an extension of our research team’s work

Read more

Summary

Introduction

The rapid development of Web 2.0 and constant expansion of the network size has led to a great increase in the amount of informational resources in multiple languages on the Internet. Cross-language information retrieval (CLIR) enables people to retrieve documents written in multiple languages through a single query. Despite the rapid advancements in this field, CLIR still has a major choking point when a query involves the translation of unknown terms, which is known as an out-of-vocabulary (OOV) problem. If unknown terms are translated incorrectly, the performance of CLIR or other systems is greatly reduced. This research suggests a new approach for mining translations of web-based terms. It adopts co-occurrence information to expand the cross-linguistic terms so as to more effectively extract bilingual web snippets that are more relevant.

Related Works
Architecture
Collection of Bilingual Snippets
Extraction of Candidate Terms
BEGIN Procedure FCMAI
END Procedure FCMAI
Selection of Translations
Frequency–Distance Model
Match Modeling of Surface Patterns
The Transliteration Model
Combination of Features
Experimental Evaluation
Method
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.