Abstract

Text coherence analysis is the most challenging task in Natural Language Processing (NLP) than other subfields of NLP, such as text generation, translation, or text summarization. There are many text coherence methods in NLP, most of them are graph-based or entity-based text coherence methods for short text documents. However, for long text documents, the existing methods perform low accuracy results which is the biggest challenge in text coherence analysis in both English and Bengali. This is because existing methods do not consider misspelled words in a sentence and cannot accurately assess text coherence. In this paper, a text coherence analysis method has been proposed based on the Misspelling Oblivious Word Embedding Model (MOEM) and deep neural network. The MOEM model replaces all misspelled words with the correct words and captures the interaction between different sentences by calculating their matches using word embedding. Then, the deep neural network architecture is used to train and test the model. This study examines two different types of datasets, one in Bengali and the other in English, to analyze text consistency based on sentence sequence activities and to evaluate the effectiveness of this model. In the Bengali language dataset, 7121 Bengali text documents have been used where 5696 (80%) documents have been used for training and 1425 (20%) documents for testing. And in the English language dataset, 6000 (80%) documents have been used for training and 1500 (20%) documents for model evaluation out of 7500 text documents. The efficiency of the proposed model is compared with existing text coherence analysis techniques. Experimental results show that the proposed model significantly improves automatic text coherence detection with 98.1% accuracy in English and 89.67% accuracy in Bengali. Finally, comparisons with other existing text coherence models of the proposed model are shown for both English and Bengali datasets.

Highlights

  • Text coherence analysis is a very well-known key term in natural language processing for a text with multiple sentences [1]

  • If we identify misspelling sentences and determine word vectors for correct words from a misspelled word, it is a new dimension for coherence analysis

  • 1) Model inputs: Since this study considers words out of vocabulary, misspelled words, etc. the input of the proposed coherence model will be the output of the misspelling word embedding model which are word vectors of different types of words

Read more

Summary

Introduction

Text coherence analysis is a very well-known key term in natural language processing for a text with multiple sentences [1]. With the rapid development of digital communication mediums such as social networks, mobile devices, or online news portals it is more complex to identify which information is consistent or inconsistent. It is very difficult to check the consistency of text among sentences with sort time without automatic evaluation. During digital communication or online assessment or reporting news sometimes a naive user may misspell some word or couple of words in their whole text [2]. Common errors such as grammatical mistakes, vocabulary, or syntax errors can be determined, but finding text coherence between paragraphs is very difficult both in the manual and computerized systems.

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call