Chapter 3 - Linguistic Data Resources

Christopher Cieri,Mark Liberman,Victoria Arranz,Khalid Choukri

doi:10.1016/b978-012088501-5/50006-8

Abstract

This chapter provides an overview of available language resources, from both U.S. and European perspectives. Multilingual data repositories as well as large ongoing and planned collection efforts are introduced, along with a description of the major challenges of collection efforts, such as transcription issues due to inconsistent writing standards, subject recruitment, recording equipment, legal aspects, and costs in terms of time and money. The overview of multilingual resources comprises multilingual audio and text data, pronunciation dictionaries, and parallel bilingual/multilingual corpora. This chapter provides an overview of existing language resources in Europe. A number of projects in Europe have been working toward the production of multilingual speech and language resources, many of which have become key databases for the human language technology (HLT) community. The SpeechDat projects are a set of speech data-collection efforts funded by the European Commission with the aim of establishing databases for the development of voice-operated teleservices and speech interfaces. The resulting databases are available via European Language Resources Association (ELRA).

Full Text