Abstract

It is well known that the statistical machine translation (SMT) performance suffers when a model is applied to out-of-domain data. It is also known that the more similar the test domain and the training domain are, the more efficient the training data are for SMT performance. Hence, measuring the similarity of domains is an important task to select appropriate training data. The most widely used method uses the cosine similarity function and word frequency. The lack of exploring other approaches motivates us to propose and compare several similarity measures. Aiming for better SMT performance, we compared 10 similarity measures, which are a combination of 2 feature representations and 5 similarity functions. The results show that using the relative word frequency as the feature representation and using the skew divergence as the similarity function performs the best amongst the 10 measures and outperforms random data selection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.