Distributed Representation of Words in Vector Space for Kannada Language

Pandurang S Kambali,B M Sagar,Sanjana Suri

doi:10.1109/csitss.2018.8768761

Abstract

An objective of neural language modelling is to take in the joint probability function of sequence of words in a language. This is characteristically dif• cult due to the huge computation requirement and curse of dimensionality. A word sequence, the model will encounter during testing is probably going to be not quite the same as all the word sequence seen amid training. Recent works in learning word vector representation are successful in capturing semantic and syntactic relationship between words of a language. These word embeddings are proven to be very efficient in various Natural Language Processing (NLP) tasks like Machine Translation, Question Answering, Text summarization etc. Training word embeddings with neural networks has been prevalent among NLP researchers. Two major models, Continuous Bag of Words (CBOW) and Skip-gram have not only improved the accuracy but also reduced the training time. However, the vector space representation can still be improved using some existing techniques which are rarely used together like subword model, where a word is represented as a weighted average of n-gram representation. Pre-trained word vectors are key requirements in any NLP tasks, generating word vectors for Indian languages has drawn very less attention. This paper proposes a distributed representation for Kannada words using an optimal neural network model and combining various known techniques.

Full Text