Abstract
Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are wild about tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this paper, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, i.e. TRANSFORMER, still severely suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind, we summarized two challenges in NMT for STs associated with translation error types above, respectively: 1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; 2) the lack of contextual information makes NMT have higher data uncertainty on short sentences, and NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g. data upsampling) and complementing contextual information (e.g. introducing translation memory) can pertinently alleviate the translation issues in NMT for STs. We are delighted to embrace researchers investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.