Abstract

Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization , categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP includes many steps such as preprocessing the corpus, lemmatization and so on. In this paper the tokens are extracted by two methods and on two corpora. BaSa, a context-based term extraction technique having different NLP activities, e.g. Term Frequency Inverse Document Frequency (TF-IDF) and Zipf ‘s law are used to count and compare extracted tokens. Further token comparison between both of the methods is achieved. The corpus contains proses and verses of Hindi as well as the Marathi language. Common tokens from corpora of verses and proses of Marathi as well as Hindi are identified to prove that both of them behave same as per as NLP activities are concerened. The betterment of BaSa over Zipf’s law is proved. Hindi Corpus includes 820 stories and 710 poems and Marathi corpus includes 610 stories and 505 poems.

Highlights

  • Hindi and Marathi languages are popular in the world and are used as an official language in North India and Maharashtra, respectively [1]

  • India is a diverse country having around 23 different official languages and this has opened a wide area for natural language processing researchers

  • It focuses on the challenges of sentiment mining for Hindi tweets

Read more

Summary

Introduction

Hindi and Marathi languages are popular in the world and are used as an official language in North India and Maharashtra, respectively [1]. Abundant Hindi and Marathi text get generated day by day. To process this data NLP techniques along with machine learning algorithms are available in the literature. To analyze the behavior of algorithms, the corpus of Hindi or Marathi poems and stories is being used. Stories and poems act as a guide to children about their behavior and manners [2,3] and connect with elders to interconnect ideas and visualize life’s opportunities. A model is proposed for carrying out a sentiment analysis on Hindi tweets. It focuses on the challenges of sentiment mining for Hindi tweets. The growth of Indian languages over a period in the area of sentiment mining is stated along with the taxonomy of Indian languages

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call