Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus

Ebrahim Ansari

doi:10.17485/ijst/2014/v7i9.19

Abstract

In recent years, many studies on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Nearly all apply an existing small dictionary or other resource to make an initial list named seed dictionary. In this paper we discuss on using different types of dictionaries and their combinations as the initial starting list to produce a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments applied state of the art techniques on four different seed dictionaries; an existing dictionary and three dictionaries created with pivot-based schema considering three different languages as pivot. We have used English, Arabic and French as pivot languages to extract these three pivot based dictionaries. An interesting challenge in our approach is proposing a method to combine different dictionaries together producing a better and more accurate lexicon. In order to combine seed dictionaries we proposed two novel combination models and examine the effect of them on comparable corpora which are collected from News Agencies. The experimental results exploited by our implementation show the efficiency of our proposed combinations.

Highlights

Introduction and Related WorksIn the last decade, some research has been proposed to acquire bilingual lexicons from non-parallel corpora
A comparable corpus consists of sets of documents in several languages dealing with a given topic, or domain when documents have been composed independently of each other in different languages
Comparable corpora are much easier to build from commonly available documents, such as news article pairs describing the same event in different languages

Summary

Introduction and Related Works

Some research has been proposed to acquire bilingual lexicons from non-parallel (comparable) corpora. There is growing interest in acquiring bilingual lexicons from comparable corpora These methods are based on the assumption, which there is a correlation between co-occurrence patterns in different languages[1]. The starting point of their strategy is a list of bilingual expressions that are used to build the context vectors of all words in both languages. This starting list, or initial dictionary, is named the seed dictionary[2] and is usually provided by an external bilingual dictionary[3,4,5,6]. Some of recent methods use small parallel corpora to create their seed list[7] and some of them use no dictionary for starting phases[8]. Other three dictionaries are extracted from a pivot-based method using English, French and Arabic as the pivot language individually

Using Pivot Languages to Create Bilingual Lexicon

Using Comparable Corpora

Our Approach

Building Seed Dictionaries

Existed Dictionary–DicEx

Using Seed Dictionaries to Extract Lexicon from Comparable Corpora

The Core System

Using Simple Combination

Using Independent Word Combination

Preparing the Inputs

Seed Dictionaries

Comparable Corpora

Experimental Results

Using Independent Dictionaries

Using Composite Dictionaries

Conclusion and Future Works

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Indian Journal of Science and Technology	Publication Date: Sep 20, 2014
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Indian Journal of Science and Technology

Lead the way for us

Similar Papers

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus
...
Indian journal of science and technology | VOL. 7
, et. al. ...
20 Sep 2014
Indian journal of science and technology | VOL. 7

Mining comparable bilingual text corpora for cross-language information integration
Tao Tao ... Chengxiang Zhai
-
Tao Tao, et. al.Tao Tao ... Chengxiang Zhai
21 Aug 2005
21 Aug 2005

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora
Pascale Fung
-
Pascale FungPascale Fung
01 Jan 1998
01 Jan 1998

Impact of Dictionary Use Skills Instruction on Second Language Writing

-

22 Dec 2012
22 Dec 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Indian Journal of Science and Technology