Abstract

The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the aspects of methodology and linguistic engineering, which serve to develop a multi-purpose multilingual text corpus. This approach was applied on written texts of French, English, Spanish, Vietnamese, Khmer and Chinese and on speech transcriptions of French, English, Italian, Chinese and Taiwanese. It consists in splitting the text normalization problem in a set of minor sub-problems as language-independent as possible. A set of text corpus normalization tools with linked resources and a document structuring method are proposed and distributed under the terms of the GPL license.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call