Abstract

Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages within a dialogue. The extraction of relevant and meaningful information from mixed set of languages poses a tedious exercise. The objective of the paper is to perform named entity recognition (NER), one of the challenging task in the domain of natural language processing. The method proposed herein explores a novel exhaustive comparison study, heretofore un-addressed among four word embedding approaches like Continuous Bag of Words model (CBOW), Skip gram model, Term Frequency and Inverse Document Frequency (TF-IDF) and Global Vectors for Word Representation (GloVe). These word vector representing schemes decipher the meaning of words in different dimensions, such as in code mixed language pair English-Hindi. These word vectors or feature vectors, computed from co-occurrences, yielded good cross-validation scores when compared with six conventional machine learning algorithms. The study reveals Tf-IDF is the best word embedding model yielding the highest accuracy for the small dataset. Precision, Recall, and F-measure were used as evaluation measures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.