Abstract

Parallel corpora are data sets created by representing sentences with the same meaning in different languages. One of the most important elements that determine the quality in machine translation systems is the parallel corpora created in large quantities and with high quality. Such data for the Turkish – English language pair are generally insufficient. In this study, a large amount of parallel corpora has been created that can be used for academic translations between Turkish and English languages. While creating this data set, the abstracts of the postgraduate theses were used. The best matches were obtained using sentence alignment algorithms such as Vecalign and Hunalign. As a result of the studies, 1M parallel sentence pairs were obtained. In addition, an Bi-LSTM-based translation system was created to measure the quality of the obtained data. The created model obtained 15.8 Bleu points with zero-shot learning method on the TED (Tr-En) test set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.