Effect of linguistic information in neural machine translation

Naomichi Nakamura,Hitoshi Isahara

doi:10.1109/icaicta.2017.8090975

Abstract

Deep Neural Networks(DNNs) outperform previous works in many fields such as in natural language processing. Neural Machine Translation(NMT) also outperforms Statistical Machine Translation(SMT) which has complex features and rules. However, NMT requires a large corpus and a long calculation time. In order to suppress calculation cost, recent researches replaced low frequency words with symbols. However, the symbols make sentences ambiguous and deteriorates translation accuracy. To solve this problem, sub-word units such as Byte Pair Encoding(BPE) and Wordpiece Model(WPM) creating vocabularies in a prespecified vocabulary size has been proposed. Nevertheless, these tokenize methods break words and treat them as symbols. Words as symbols are compatible with neural networks and NMT performance has increased. This result shows that linguistic correctness is not necessarily important in NMT. If that is the case, we wonder to what extent linguistic correctness contributes to NMT accuracy. In this research, we experiment to incorporate linguistic information into sub-word units. Experimentally, we demonstrate that morpheme as linguistic information is a helpful factor for sub-word units.

Full Text