Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

K Almeman,M Lee

doi:10.1109/iccspa.2013.6487247

Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

K Almeman, M Lee

https://doi.org/10.1109/iccspa.2013.6487247

Copy DOI

Publication Date: Feb 1, 2013

Citations: 31

Affiliation: University of Birmingham

#Arabic Dialect #Dialect Text + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorize distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.