Word Embedding for Small and Domain-specific Malay Corpus

Sabrina Tiun,Nor Fariza Mohd Nor,Azhar Jalaludin,Anis Nadiah Che Abdul Rahman

doi:10.1007/978-981-15-0058-9_42

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Word Embedding for Small and Domain-specific Malay Corpus

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An empirical assessment of different word embedding and deep learning models for bug assignment
Rongcun Wang ... Rubing Huang
The Journal of Systems & Software | VOL. 210
Rongcun Wang, et. al.Rongcun Wang ... Rubing Huang
06 Jan 2024
The Journal of Systems & Software | VOL. 210

A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art
Juan J Lastra-Díaz ... Eneko Agirre
Engineering Applications of Artificial Intelligence | VOL. 85
Juan J Lastra-Díaz, et. al.Juan J Lastra-Díaz ... Eneko Agirre
01 Aug 2019
Engineering Applications of Artificial Intelligence | VOL. 85

Additive Compositionality of Word Vectors
Yeon Seonwoo ... Sungjoon Park
-
Yeon Seonwoo, et. al.Yeon Seonwoo ... Sungjoon Park
01 Jan 2019
01 Jan 2019

Optimizing Word Embeddings for Patient Portal Message Datasets with a Small Number of Samples.
Qingyuan Song ... Lijun Song
Research square | VOL. -
Qingyuan Song, et. al.Qingyuan Song ... Lijun Song
15 May 2024
Research square | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Word Embedding for Small and Domain-specific Malay Corpus

Abstract

Talk to us

Similar Papers