Wikipedia-based cross-language text classification

Marcos Antonio Mouriño García,Roberto Pérez Rodríguez,Luis Anido Rifón

doi:10.1016/j.ins.2017.04.024

Marcos Antonio Mouriño García, Roberto Pérez Rodríguez + Show 1 more

https://doi.org/10.1016/j.ins.2017.04.024

Copy DOI

Journal: Information Sciences	Publication Date: Apr 13, 2017
Citations: 10

Affiliation: Universidade de Vigo

Abstract

This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches – typically based on the machine translation (MT) of documents, which are represented as bags of words (BoW). We propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in different languages and thus not relying on automated full-text translations. We describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals we conducted several experiments with three cross-lingual corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Wikipedia-based cross-language text classification

Abstract

Talk to us

Similar Papers

More From: Information Sciences

Lead the way for us

Similar Papers

기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석
Yong-Gu Lee
Journal of the Korean Society for Library and Information Science | VOL. 43
Yong-Gu LeeYong-Gu Lee
30 Mar 2009
Journal of the Korean Society for Library and Information Science | VOL. 43

Para além da questão: (não) ensinar gramática?
Edair Maria Görski ...
Working Papers em Linguística | VOL. 18
Edair Maria Görski, et. al.Edair Maria Görski ...
13 Jan 2018
Working Papers em Linguística | VOL. 18

Active Learning for Cross Language Text Categorization
Yue Liu ... Weitao Zhou
-
Yue Liu, et. al.Yue Liu ... Weitao Zhou
01 Jan 2012
01 Jan 2012

Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification
Aditya Mogadala ... Achim Rettinger
-
Aditya Mogadala, et. al.Aditya Mogadala ... Achim Rettinger
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Wikipedia-based cross-language text classification

Abstract

Talk to us

Similar Papers

More From: Information Sciences