Abstract

In this chapter, we propose for applying processes like pre-editing and text standardization as some of the essential components of corpus editing and text normalization for making a text corpus ready for access across various domains of linguistics and language technology. Here, we identify some of the basic pre-editing and text standardization tasks, and we describe these works with reference to Bangla text corpus. As the name suggests, text normalization involves diverse tasks of text adjustment and standardization to improve utility of the texts stored in a corpus in manual- and machine-based applications. The methods and the strategies that we propose here to overcome the problems of text normalization are largely tilted toward written text corpus since text normalization activities relating to spoken text corpus usually invoke a new set of operations that hardly match with the normalization processes normally applied on written text corpus. The normalized version of a text not only reduces workload in subsequent utilization of a corpus but also enhances its accessibility by man and machine across all domains where language corpus has application and referential relevance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call