Abstract

This paper provides the theoretical grounding in constituting databases related to PE- and PEN-, two Indonesian nominalizing prefixes, which have various meanings (e.g., patient, agent, or instrument). The first database contains the words with PE- and PEN- whereas the second database provides the cosine similarity between two words of interest. Using a written Indonesian corpus as the primary source (Leipzig Corpora Collection), the databases contain the following information: PE- or PEN- prefixes, allomorph of PEN-, base word, semantics role, morphological variation, cosine similarity, as well as the word frequency. Furthermore, this paper elaborates the theoretical consideration on how each information was cultivated. In building the databases, Indonesian morphological parser and Word to Vector were used to analyze the Indonesian morphological status and to put the words in the corpus into a vector. In addition, manual verification for the data against the Indonesian comprehensive dictionary was also conducted. In the end, the databases are available for free so that the data could be used as materials for a corpus-based analysis on Indonesian morphology. This research shed light to a careful and thorough classification of the open-access databases of PE- and PEN- from their allomorphs, base word, semantics role, and morphological variation. The information provided in this article is hoped to be contributive in Indonesian morphology specifically, and other linguistics fields (e.g., corpus linguistics and quantitative linguistics) in general.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call