Abstract

Recent attitude towards studying a language and a linguistic phenomenon is based upon the existence of a collection of data; therefore it is required to develop a linguistic corpus that is naturally occurred and it is not collected from the one’s intuition. This research methodology is highly important to study linguistic historical data, which is dead and has no speaker. The current research puts an effort to develop a linguistic corpus of middle Persian and to organize the data in a data-base. To this end, six information levels are determined in the annotation process, including transliteration of the Pahlavi texts, transcription of the words along with their Persian translation, defining fine-grained syntactic category of the words, lemmatizing the words, and identifying whether the word is huzwāres or not. To define fine-grained syntactic categories, the tag set for contemporary Persian developed by Bijankhan et al (2011) and organized by Ghayoomi (2004) are modified and adapted to the Pahlavi language according to the requirements. The new tag set is used to label Pahlavi words. After annotating words and organizing the information, extracting the statistical information is possible to deepen the insight over the text’s content.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call