Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language
In this work, we investigate the possibility of using fully automatic text simplification system on the English source in machine translation (MT) for improving its translation into an under-resourced language. We use the state-of-the-art automatic text simplification (ATS) system for lexically and syntactically simplifying source sentences, which are then translated with two state-of-the-art English-to-Serbian MT systems, the phrase-based MT (PBMT) and the neural MT (NMT). We explore three different scenarios for using the ATS in MT: (1) using the raw output of the ATS; (2) automatically filtering out the sentences with low grammaticality and meaning preservation scores; and (3) performing a minimal manual correction of the ATS output. Our results show improvement in fluency of the translation regardless of the chosen scenario, and difference in success of the three scenarios depending on the MT approach used (PBMT or NMT) with regards to improving translation fluency and post-editing effort.
- Research Article
48
- 10.1007/s10590-018-9214-x
- Feb 10, 2018
- Machine Translation
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54\%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.
- Research Article
12
- 10.1007/s10590-021-09266-0
- Jun 1, 2021
- Machine Translation
Examining the general impact of Controlled Language (CL) rules in the context of Machine Translation (MT) has been an area of research for many years. The present study focuses on the following question: how do CL rules impact MT output individually? By analysing a German corpus-based test suite of technical texts that have been translated into English by different MT systems, this study endeavours to answer this question at different levels: the general impact of CL rules (rule- and system-independent), their impact at rule level (system-independent) as well as at rule and system level. The results of five MT systems are analysed and contrasted: a rule-based system, a statistical system, two differently constructed hybrid systems, and a neural system. For this, a mixed-methods triangulation approach that includes error annotation, human evaluation, and automatic evaluation was applied. The data was analysed both qualitatively and quantitatively in terms of CL influence on the following parameters: number and type of MT errors, style and content quality, and scores of two automatic evaluation metrics. In line with many studies, the results show a general positive impact of the applied CL rules on the MT output. However, at rule level, only four rules proved to have positive effects on the aforementioned parameters; three rules had negative effects on the parameters; and two rules did not show any significant impact. At rule and system level, the rules affected the MT systems differently, as expected. Rules that had a positive impact on earlier MT approaches did not show the same impact on the neural MT approach. Furthermore, neural MT delivered distinctly better results than earlier MT approaches, namely the highest error-free, style and content quality rates both before and after applying the rules, which indicates that neural MT offers a promising solution that no longer requires CL rules for improving the MT output.
- Research Article
20
- 10.1007/s10590-018-9219-5
- Apr 24, 2018
- Machine Translation
This work presents an extensive comparison of language-related problems for neural machine translation (NMT) and phrase-based machine translation (PBMT) for German-to-English, English-to-German and English-to-Serbian. The explored issues are related both to the characteristics of the languages as well as to the (machine) translation process and, although related, go beyond typical translation error classes. It is shown that the main advantage of the NMT approach consists of better generating verb forms, avoiding verb omissions, as well as better handling of English noun collocations and negation. It is also shown that the main obstacles for the NMT system are prepositions, translation of English (source) ambiguous words and generating English (target) continuous and perfect tenses. In addition, preliminary experiments show that a number of issues are complementary, i.e., not occurring in the same segments and/or in the same form. This means that a combination or hybridisation of the NMT and PBMT approaches is a promising direction for improving both types of systems.
- Preprint Article
1
- 10.20944/preprints202502.0656.v1
- Feb 10, 2025
- Preprints.org
This study presents a hybrid artificial intelligence model designed to enhance translation quality for low-resource languages, specifically targeting the Hakka language. The proposed model integrates phrase-based machine translation (PBMT) and neural machine translation (NMT) within a recursive learning framework. The methodology consists of three key stages: (1) initial translation using PBMT, where Hakka corpus data is structured into a parallel dataset, (2) NMT training with Transformers, leveraging the generated parallel corpus to train deep learning models, and (3) recursive translation refinement, where iterative translations further enhance model ac-curacy by expanding the training dataset. The study employs preprocessing techniques to clean and optimize the dataset, reducing noise and improving sentence segmentation. A BLEU score evaluation is conducted to compare the effectiveness of PBMT and NMT across various corpus sizes, demonstrating that while PBMT performs well with limited data, the Transformer-based NMT achieves superior results as training data increases. The findings highlight the advantages of a hybrid approach in overcoming data scarcity challenges for minority languages. This research contributes to machine translation methodologies by proposing a scalable framework for improving linguistic accessibility in under-resourced languages.
- Dissertation
2
- 10.23889/suthesis.57439
- Jul 22, 2021
In general, advances in translation technology tools have enhanced translation quality significantly. Unfortunately, however, it seems that this is not the case for all language pairs. A concern arises when the users of translation tools want to work between different language families such as Arabic and English. The main problems facing Arabic<>English translation tools lie in Arabic’s characteristic free word order, richness of word inflection – including orthographic ambiguity – and optionality of diacritics, in addition to a lack of data resources. The aim of this study is to compare the performance of translation memory (TM) and machine translation (MT) systems in translating between Arabic and English.The research evaluates the two systems based on specific criteria relating to needs and expected results. The first part of the thesis evaluates the performance of a set of well-known TM systems when retrieving a segment of text that includes an Arabic linguistic feature. As it is widely known that TM matching metrics are based solely on the use of edit distance string measurements, it was expected that the aforementioned issues would lead to a low match percentage. The second part of the thesis evaluates multiple MT systems that use the mainstream neural machine translation (NMT) approach to translation quality. Due to a lack of training data resources and its rich morphology, it was anticipated that Arabic features would reduce the translation quality of this corpus-based approach. The systems’ output was evaluated using both automatic evaluation metrics including BLEU and hLEPOR, and TAUS human quality ranking criteria for adequacy and fluency.The study employed a black-box testing methodology to experimentally examine the TM systems through a test suite instrument and also to translate Arabic English sentences to collect the MT systems’ output. A translation threshold was used to evaluate the fuzzy matches of TM systems, while an online survey was used to collect participants’ responses to the quality of MT system’s output. The experiments’ input of both systems was extracted from Arabic<>English corpora, which was examined by means of quantitative data analysis. The results show that, when retrieving translations, the current TM matching metrics are unable to recognise Arabic features and score them appropriately. In terms of automatic translation, MT produced good results for adequacy, especially when translating from Arabic to English, but the systems’ output appeared to need post-editing for fluency. Moreover, when retrievingfrom Arabic, it was found that short sentences were handled much better by MT than by TM. The findings may be given as recommendations to software developers.
- Research Article
- 10.5445/ir/1000104498
- Feb 14, 2020
- Repository KITopen (Karlsruhe Institute of Technology)
Multilingual Neural Translation
- Research Article
89
- 10.3389/fdigh.2018.00009
- May 15, 2018
- Frontiers in Digital Humanities
We conduct the first experiment in the literature in which a novel is translated automatically and then post-edited by professional literary translators. Our case study is Warbreaker, a popular fantasy novel originally written in English, which we translate into Catalan. We translated one chapter of the novel (over 3,700 words, 330 sentences) with two data-driven approaches to Machine Translation (MT): phrase-based statistical MT (PBMT) and neural MT (NMT). Both systems are tailored to novels; they are trained on over 100 million words of fiction. In the post-editing experiment, six professional translators with previous experience in literary translation translate subsets of this chapter under three alternating conditions: from scratch (the norm in the novel translation industry), post-editing PBMT, and post-editing NMT. We record all the keystrokes, the time taken to translate each sentence, as well as the number of pauses and their duration. Based on these measurements, and using mixed-effects models, we study post-editing effort across its three commonly studied dimensions: temporal, technical and cognitive. We observe that both MT approaches result in increases in translation productivity: PBMT by 18%, and NMT by 36%. Post-editing also leads to reductions in the number of keystrokes: by 9% with PBMT, and by 23% with NMT. Finally, regarding cognitive effort, post-editing results in fewer (29 and 42% less with PBMT and NMT, respectively) but longer pauses (14 and 25%).
- Research Article
16
- 10.3390/digital1020007
- Apr 2, 2021
- Digital
Phrase-based statistical machine translation (PB-SMT) has been the dominant paradigm in machine translation (MT) research for more than two decades. Deep neural MT models have been producing state-of-the-art performance across many translation tasks for four to five years. To put it another way, neural MT (NMT) took the place of PB-SMT a few years back and currently represents the state-of-the-art in MT research. Translation to or from under-resourced languages has been historically seen as a challenging task. Despite producing state-of-the-art results in many translation tasks, NMT still poses many problems such as performing poorly for many low-resource language pairs mainly because of its learning task’s data-demanding nature. MT researchers have been trying to address this problem via various techniques, e.g., exploiting source- and/or target-side monolingual data for training, augmenting bilingual training data, and transfer learning. Despite some success, none of the present-day benchmarks have entirely overcome the problem of translation in low-resource scenarios for many languages. In this work, we investigate the performance of PB-SMT and NMT on two rarely tested under-resourced language pairs, English-To-Tamil and Hindi-To-Tamil, taking a specialised data domain into consideration. This paper demonstrates our findings and presents results showing the rankings of our MT systems produced via a social media-based human evaluation scheme.
- Conference Article
13
- 10.18653/v1/w15-4110
- Jan 1, 2015
In this presentation, I would like to introduce the research and products of machine translation in Baidu. As the biggest Chinese search engine, Baidu has released its machine translation system in June, 2011. It now supports translations among 27 languages on multiple platforms, including PC, mobile devices, etc. Hybrid translation approach is important for building an Internet translation system. As we know, the translation demands on the Internet come from various domains, including news wires, patents, poems, idioms, etc. It is difficult for a single translation system to achieve high accuracy on all domains. Therefore, hybrid translation is practically needed. Generally, we build a statistical machine translation (SMT) system, using the training corpora automatically crawled from the web. For the translation of idioms (e.g. “有 志者,事竟成,where there is a will, there is a way”), hot words/expressions (e.g. “一带一路, One Belt and One Road ”), example-based translation methods are used. To improve the translation of date (e.g. “2012年7月6日, July 6, 2012”), numbers (e.g. “三千五百万, thirty-five million), etc, rule-based methods are used as pre-process. To improve translation quality for the resourcepoor language pairs, we used pivot-based methods. Wu and Wang (2007) proposed the triangulation method that combines the source-pivot and the pivot-target phrase tables to induce a sourcetarget phrase table. To fill up the data gap between the source-pivot and pivot-target corpora, Wu and Wang (2009) employed a hybrid method combining RBMT and SMT systems. We also proposed a method to use a Markov random walk to discover implicit relations between phrases in the source and target languages (Zhu et al., 2013), thus to improve the coverage of phrase pairs. We utilized the co-occurrence frequency of source-target phrase pairs to estimate phrase translation probabilities (Zhu et al., 2014). On May 20th this year, we have launched a neural machine translation (NMT) system for Chinese-English translation. The system conducts end-to-end translation with a source language encoder and a target language decoder. Both the encoder and decoder are recurrent neural networks. The strength of NMT lies in that it can learn semantic and structural translation information by taking global contexts into account. We further integrated the SMT and NMT system to improve translation quality. We also released off-line translation packs for NMT system on mobile devices, providing translation services in case that the Internet is unavailable. So far as we know, this is the first NMT system supporting off-line translation on mobile devices. We also investigate the problem of learning a machine translation model that can simultaneously translate sentences from one source language to multiple target languages (Dong et al., 2015). Our solution is inspired by the recently proposed neural machine translation model which generalizes machine translation as a sequence learning problem. We train a unified neural machine translation model under the multi-task learning framework where the encoder is shared across different language pairs and each target language has a separate decoder. This model gets faster and better convergence for both resource-rich and resourcepoor language pairs under the multi-task learning framework. Based on the above techniques, we have released translation products for multiple platforms, including web translation on PC, APP on mobile devices, as well as free API for the thirdparty developers. Our system now support translations among 27 languages, not only including many frequently-used foreign languages, but also
- Research Article
9
- 10.1145/3610582
- Jul 25, 2023
- ACM Transactions on Asian and Low-Resource Language Information Processing
Machine Translation has been a field of study for over six decades, but it has acquired substantial prominence in the last decade as processing capacity in personal computers has increased. The purpose of this paper is to discuss the usage of Sanskrit as a source, target, or supporting language in various Machine Translation systems. To investigate Machine Translation, researchers use a variety of strategies, including corpus-based, direct, and rule-based approaches. The primary goal of employing Sanskrit in Machine Translation is to evaluate its appropriateness, lexicon, and performance when proper Machine Translation methods are used. The research examines various modelling strategies for developing a machine translation system, specifically Statistical and Neural Machine Translation, in order to bridge the gap between Sanskrit and its current successor, Hindi. Interpretations are formed in Statistical Machine Translation by matching words from the source and target languages with statistical models and bilingual text corpora to learn parameters. Neural Machine Translation, on the other hand, uses an artificial neural network to predict the likelihood of a word sequence, frequently modelling entire phrases within a single integrated model. Neural Machine Translation is implemented using an encoder-decoder architecture with an attention mechanism. One of the most significant contributions of this paper is the use of different data sources, data collecting, and scraping to create a complete dataset. According to the study's findings, Neural Machine Translation outperforms the Statistical Machine Translation modelling technique. Furthermore, the paper examines the distinctive qualities of the Sanskrit language as well as the difficulties encountered by researchers in digesting Sanskrit while constructing the machine translation system. This study investigates the use of Sanskrit in Machine Translation and analyses several modelling methods, such as Statistical and Neural Machine Translation. The paper emphasizes the advantages of Neural Machine Translation and discusses the unique characteristics and challenges of the Sanskrit language in machine translation development.
- Research Article
- 10.25073/2588-1086/vnucsce.231
- May 30, 2020
- VNU Journal of Science: Computer Science and Communication Engineering
In this paper, we propose a new method for domain adaptation in Statistical Machine Translation for low-resource domains in English-Vietnamese language. Specifically, our method only uses monolingual data to adapt the translation phrase-table, our system brings improvements over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i) classify phrases on the target side of the translation phrase-table use the probability classifier model, and (ii) adapt to the phrase-table translation by recomputing the direct translation probability of phrases.
 
 Our experiments are conducted with translation direction from English to Vietnamese on two very different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the experimental results showed that our method significantly outperformed the baseline system. Our system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores over the baseline system,…
 Keywords: 
 Machine Translation, Statistical Machine Translation, Domain Adaptation
 References
 [1] Philipp Koehn, Franz Josef Och, Daniel Marcu, Statistical phrase-based translation, In Proceedings of HLT-NAACL, Edmonton, Canada, 2003, 127-133.
 [2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes and Jeffrey Dean, Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR, abs/1609.08144, 2016.
 [3] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo and Marcello Federico, Neural versus phrase-based machine translation quality: A case study, 2016.
 [4] Barry Haddow, Philipp Koehn, Analysing the effect of out-of-domain data on smt systems, In Proceedings of the Seventh Workshop on Statistical Machine Translation, 2012, 422-432.
 [5] Boxing Chen, Roland Kuhn and George Foster, Vector space model for adaptation in statistical machine translation, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013, pp. 1285-1293.
 [6] Daniel Dahlmeier, Hwee Tou Ng, Siew Mei Wu4, Building a large annotated corpus of learner english: The nus corpus of learner english, In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Appli-cations, 2013.
 [7] Eva Hasler, Phil Blunsom, Philipp Koehn and Barry Haddow, Dynamic topic adaptation for phrase-based mt, In Proceedings of the 14th Conference of the European Chapter of The Association for Computational Linguistics, 2014, pp. 328-337.
 [8] George Foster, Roland Kuhn, Mixture-model adaptation for smt, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Association for Computational Linguistics, 2007, pp. 128-135.
 [9] George Foster, Boxing Chen, Roland Kuhn, Simulating discriminative training for linear mixture adaptation in statistical machine translation, Proceedings of the MT Summit, 2013.
 [10] Hoang Cuong, Khalil Sima’an, and Ivan Titov, Adapting to all domains at once: Rewarding domain invariance in smt, Proceedings of the Transactions of the Association for Computational Linguistics (TACL), 2016.
 [11] Ryo Masumura, Taichi Asam, Takanobu Oba, Hirokazu Masataki, Sumitaka Sakauchi, and Akinori Ito, Hierarchical latent words language models for robust modeling to out-of domain tasks, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1896-1901.
 [12] Chenhui Chu, Raj Dabre, and Sadao Kurohashi. An empirical comparison of simple domain adaptation methods for neural machine translation, 2017.
 [13] Markus Freitag, Yaser Al-Onaizan, Fast domain adaptation for neural machine translation, 2016.
 [14] Jia Xu, Yonggang Deng, Yuqing Gao and Hermann Ney, Domain dependent statistical machine translation, In Proceedings of the MT Summit XI, 2007, pp. 515-520.
 [15] Hua Wu, Haifeng Wang Chengqing Zong, Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora, In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 2008, pp. 993-1000.
 [16] Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, A maximum entropy approach to natural language processing, Computational Linguistics, 22, 1996.
 [17] 18Santanu Pal, Sudip Naskar, Josef Van Genabith, Uds-sant, English-German hybrid machine translation system, In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September, Association for Computational Linguistics, 2015, pp. 152-157.
 [18] Louis Onrust, Antal van den Bosch, Hugo Van hamme, Improving cross-domain n-gram language modelling with skipgrams, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, pp. 137-142.
 [19] Mark Aronoff, Kirsten Fudeman, What is morphology, V 8. john wiley and sons, 2011.
 [20] Laurence C. Thompson, The problem of the word in vietnamese, In journal of the International Linguistic Association 19(1) (1963) 39-52. https:// doi.org/1080/00437956.1963.11659787.
 [21] Binh N. Ngo, The Vietnamese language learning framework, Journal of Southeast Asian Language Teaching 10 (2001) 1-24.
 [22] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, Ho Tuong Vinh, A hybrid approach to word segmentation of vietnamese texts, 2008.
 [23] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open source toolkit for statistical machine translation, In ACL-2007: Proceedings of demo and poster sessions, Prague, Czech Republic, 2007, pp.177-180.
 [24] Franz Josef Och, Minimum error rate training in statistical machine translation, In Proceedings of ACL, 2003, pp.160-167.
 [25] Andreas Stolcke, Srilm - an extensible language modeling toolkit, in proceedings of international conference on spoken language processing, 2002.
 [26] Papineni, Kishore, Salim Roukos, Todd Ward, WeiJing Zhu, Bleu: A method for automatic evaluation of machine translation, ACL, 2002.
 [27] G. Klein, Y. Kim, Y. Deng, J. Senellart, A.M. Rush, OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints.
 [28] Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip Kr. Naskar, Andy Way and Josef van Genabith, Combining multi-domain statistical machine translation models using automatic classifiers, In Proceedings of AMTA 2010., 2010.
- Conference Article
- 10.5339/qfarc.2018.ictpd405
- Jan 1, 2018
In this work, we present Qatar Computing Research Institute»s live speech translation system. Our system works with both Arabic and English. It is designed using an array of modern web technologies to capture speech in real time, and transcribe and translate it using state-of-the-art Automatic Speech Recognition (ASR) and Machine Translation (MT) systems. The platform is designed to be useful in a wide variety of situations like lectures, talks and meetings. It is often the case in the Middle East that audiences in talks understand either Arabic or English alone. This system enables the speaker to talk in either language, and the audience to understand what is being spoken even if they are not bilingual.The system consists of three primary modules, i) a Web application, ii) ASR system, iii) and a statistical/neural MT system. The three modules are optimized to work jointly and process the speech at a real-time factor close to one - which means that the systems are optimized to keep up with the speaker and provide the results with a short delay, comparable to what we observe in (human) interpretation. The real-time factor for the entire pipeline is 1.18. The Web application is based on the standard HTML5 WebAudio application programming interface. It captures speech input from a microphone on the user»s device and transmits it to the backend servers for processing. The servers send back the transcriptions and translations of the speech, which is then displayed to the user. Our platform features a way to instantly broadcast live sessions for anyone to see the transcriptions and translations of a session in real-time without being physically present at the speaker»s location. The ASR system is based on KALDI, a state-of-the-art toolkit for speech recognition. We use a combination of time delay neural networks (TDNN) and long-short term memory neural network (LSTM) to ensure real time transcription of the incoming speech while ensuring high quality output. The Arabic and English systems have average word error rates of 23% and 9.7% respectively. The Arabic system consists of the following components: i) a character based lexicon of size 900K; the lexicon maps words to sound units to learn acoustic representation, ii) 40 dimensional high-resolution features extracted for each speech frame to digitize the audio signal, iii) a 100-dimensional i-vectors for each frame to facilitate speaker adaptation, iv) TDNN acoustic models, and v) Tri-gram language model trained using 110 M words, and restricted to 900 K vocabulary.The MT system has two choices for the backend – a statistical phrase-based system and a neural MT system. Our phrase-based system is trained with Moses, a state-of-the-art statistical MT framework, and the neural-based systems is trained with Nematus, a state-of-the-art neural MT framework. We use Modified Moore-Lewis filtering to select the best subset of the available data to train our phrase-based system more efficiently. In order to speed up the translation even further, we prune the language models backing the phrase-based system, ignoring knowledge that is not frequently used. On the other hand, our neural-based system MT system trained on all the available data as its training scales linearly with the amount of data unlike phrase-based systems. Our Neural MT system is roughly 3–5% better on the BLEU scale, a standard measure for computing the quality of translations. However, the existing neural MT decoders are slower than the phrase-based decoders translating 9.5 tokens/second versus 24 tokens/second. The trade-off between efficiency and accuracy barred us from picking only one final system. By enabling both technologies we allow the trade-off between quality and efficiency and leave it up to the user to decide whether they prefer fast or accurate system.Our system has been successfully demonstrated locally and globally at several venues like Al Jazeera, MIT, BBC and TII. The state-of-the-art technologies backing the platform for transcription and translation are also available independently and can be integrated seamlessly into any external platform. The Speech Translation system is publicly available at http://st.qcri.org/demos/livetranslation.
- Book Chapter
2
- 10.1007/978-981-16-4435-1_27
- Aug 3, 2021
Machine translation aims to minimize the language barrier between people of different linguistic backgrounds. Machine translation is an automatic translation technique between pairs of different languages. Machine translation using neural network or neural machine translation came into the picture due to several limitation associated with its predecessors which are rule based and statistical based models. For large data sets with a rich range of vocabulary, the neural network machine translation system provides fair translation accuracy. We have observed that there remain very few machine translators dedicated to Indian languages, especially those spoken in the North-East area. So our paper mainly focuses on implementing a neural machine translation system for the English-Assamese language. In this paper we used five different neural machine translation models for English-Assamese language pair using LSTMs and GRUs, along with an attention layer. We have made a comparison analysis based on the performance results of these models. We used BLEU Score to calculate the accuracy of these five models, thus achieving a BLEU score of 34.168%, which was the highest among the five models.
- Conference Article
8
- 10.1109/miucc55081.2022.9781776
- May 8, 2022
For the longest time, translation was a labor-intensive process that required just human effort. While human translation remains the most reliable method of textual content translation, it takes longer and is more expensive if done for each individual piece of information. Several Machine Translation (MT) approaches have recently emerged to facilitate the migration of any content across languages, especially for the low resource languages such as the Arabic Language. Given that Arabic is one of the world's most widely spoken languages, the task of Arabic machine translation has recently gotten a lot of interest from the scientific community. Indeed, the amount of study devoted to these low resources languages has resulted in some significant accomplishments; however, the status of Arabic MT systems falls short of the quality obtained for other languages. As a result, this survey examines the origins and main development timeline of MT approaches, investigates the significant branches, and categorizes different study orientations. In addition, it gives a comprehensive overview of the key research works that have been completed in the field of Arabic Neural MT (ANMT) and discusses possible future research prospects in this discipline.
- Conference Article
12
- 10.18653/v1/d19-6110
- Jan 1, 2019
Active learning (AL) for machine translation (MT) has been well-studied for the phrase-based MT paradigm. Several AL algorithms for data sampling have been proposed over the years. However, given the rapid advancement in neural methods, these algorithms have not been thoroughly investigated in the context of neural MT (NMT). In this work, we address this missing aspect by conducting a systematic comparison of different AL methods in a simulated AL framework. Our experimental setup to compare different AL methods uses: i) State-of-the-art NMT architecture to achieve realistic results; and ii) the same dataset (WMT’13 English-Spanish) to have fair comparison across different methods. We then demonstrate how recent advancements in unsupervised pre-training and paraphrastic embedding can be used to improve existing AL methods. Finally, we propose a neural extension for an AL sampling method used in the context of phrase-based MT - Round Trip Translation Likelihood (RTTL). RTTL uses a bidirectional translation model to estimate the loss of information during translation and outperforms previous methods.