Byte-Level Machine Reading Across Morphologically Varied Languages

Tom Kenter,Daniel Hewlett,Llion Jones

doi:10.1609/aaai.v32i1.12050

Abstract

The machine reading task, where a computer reads a document and answers questions about it, is important in artificial intelligence research. Recently, many models have been proposed to address it. Word-level models, which have words as units of input and output, have proven to yield state-of-the-art results when evaluated on English datasets. However, in morphologically richer languages, many more unique words exist than in English due to highly productive prefix and suffix mechanisms. This may set back word-level models, since vocabulary sizes too big to allow for efficient computing may have to be employed. Multiple alternative input granularities have been proposed to avoid large input vocabularies, such as morphemes, character n-grams, and bytes. Bytes are advantageous as they provide a universal encoding format across languages, and allow for a small vocabulary size, which, moreover, is identical for every input language. In this work, we investigate whether bytes are suitable as input units across morphologically varied languages. To test this, we introduce two large-scale machine reading datasets in morphologically rich languages, Turkish and Russian. We implement 4 byte-level models, representing the major types of machine reading models and introduce a new seq2seq variant, called encoder-transformer-decoder. We show that, for all languages considered, there are models reading bytes outperforming the current state-of-the-art word-level baseline. Moreover, the newly introduced encoder-transformer-decoder performs best on the morphologically most involved dataset, Turkish. The large-scale Turkish and Russian machine reading datasets are released to public.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Byte-Level Machine Reading Across Morphologically Varied Languages

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Apr 26, 2018
Citations: 12

Similar Papers

Making preferences more active
Yorick Wilks
Artificial Intelligence | VOL. 11
Yorick WilksYorick Wilks
01 Dec 1978
Artificial Intelligence | VOL. 11

Making Preferences More Active
Yorick Wilks
-
Yorick WilksYorick Wilks
16 Jun 2003
16 Jun 2003

MAKING PREFERENCES MORE ACTIVE
Yorick Wilks
Associative Networks | VOL. -
Yorick WilksYorick Wilks
01 Jan 1979
Associative Networks | VOL. -

Experiments on Character and Word Level Features for Text Classification Using Deep Neural Network
Muhammad Gumilang ... Ayu Purwarianti
-
Muhammad Gumilang, et. al.Muhammad Gumilang ... Ayu Purwarianti
01 Oct 2018
01 Oct 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Byte-Level Machine Reading Across Morphologically Varied Languages

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence