Abstract: NLP is natural language processing or neuro linguis-tic programming.Natural languages like malayalam are highly inflectional and agglutinative in nature.This is problematic whendealing with nlp based malayalam applications.So that inorder toimprove performance of malayalam nlp based applications, wordembedding improvement on malayalam corpus is needed.The improvement is based on converting the words contained inthe malayalam corpus into a standardised means removingall inflectional parts in the words in the existing malayalamcorpus ie taking root words only.All that needed is a stemmer.Inthis project i have used a malayalam morphological analyserfor taking root words of all words in the existing malayalam corpus.The advanatge of removing inflectional parts from allwords is that we can reduce the sparsity in the existing malayalam corpus.Also there will be a high hike in frequency of wordsin the resulting corpus,then the space and time complexity of wordembedding representation of the existing corpus willdecreases.According to zipfs law by increasing frequency ofwords performance of neural word embedding will increases. Zipfs Law is a discrete probability distribution that tells you the probability of encountering a word in a given corpus.By applying zipfs law am proposing there will be improvement on malyalam wordembedding.Here using fasttext, word embeddingsare performed and capture dense word vector representation ofthe malayalam corpus with dimensionality reduction from thesparse word co-occurence matrix.The improvement is mainly used for wordnet,analogy,ontology based malayalam applications.Index Terms—Morphological Analyzer,Zipfs law, Preprocessing, Testing, Training
Read full abstract