Mining Parallel Corpora from Sina Weibo and Twitter

Wang Ling,Luís Marujo,Isabel Trancoso,Alan W Black,Chris Dyer

doi:10.1162/coli_a_00249

Abstract

Microblogs such as Twitter, Facebook, and Sina Weibo (China's equivalent of Twitter) are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” messages targeting audiences who speak different languages, either by writing the same message in multiple languages or by retweeting translations of their original posts in a second language. We introduce a method for finding and extracting this naturally occurring parallel data. Identifying the parallel content requires solving an alignment problem, and we give an optimally efficient dynamic programming algorithm for this. Using our method, we extract nearly 3M Chinese–English parallel segments from Sina Weibo using a targeted crawl of Weibo users who post in multiple languages. Additionally, from a random sample of Twitter, we obtain substantial amounts of parallel data in multiple language pairs. Evaluation is performed by assessing the accuracy of our extraction approach relative to a manual annotation as well as in terms of utility as training data for a Chinese–English machine translation system. Relative to traditional parallel data resources, the automatically extracted parallel data yield substantial translation quality improvements in translating microblog text and modest improvements in translating edited news content.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Linguistics	Publication Date: Jun 1, 2016
Citations: 20	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Mining Parallel Corpora from Sina Weibo and Twitter

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics

Lead the way for us

Similar Papers

Cross-group or within-group attention flow? Exploring the amplification process among elite users and social media publics in Sina Weibo
Pianpian Wang ... Wensen Huang
Telematics and Informatics | VOL. 56
Pianpian Wang, et. al.Pianpian Wang ... Wensen Huang
18 Aug 2020
Telematics and Informatics | VOL. 56

Affective and cognitive features of comments added by forwarders in Sina Weibo during disasters
Xi Chen ... Gang Li
Proceedings of the Association for Information Science and Technology | VOL. 57
Xi Chen, et. al.Xi Chen ... Gang Li
01 Oct 2020
Proceedings of the Association for Information Science and Technology | VOL. 57

Evaluating Rumor Debunking Effectiveness During the COVID-19 Pandemic Crisis: Utilizing User Stance in Comments on Sina Weibo.
Xin Wang ... Guang Yu
Frontiers in Public Health | VOL. 9
Xin Wang, et. al.Xin Wang ... Guang Yu
30 Nov 2021
Frontiers in Public Health | VOL. 9

Subjective Well-Being of Chinese Sina Weibo Users in Residential Lockdown During the COVID-19 Pandemic: Machine Learning Analysis.
Yilin Wang ... Sijia Li
Journal of Medical Internet Research | VOL. 22
Yilin Wang, et. al.Yilin Wang ... Sijia Li
17 Dec 2020
Journal of Medical Internet Research | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining Parallel Corpora from Sina Weibo and Twitter

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics