Probabilistic Language Model Research Articles

With the explosion of multilingual content on Web, particularly in social media platforms, identification of languages present in the text is becoming an important task for various applications. While automatic language identification (ALI) in social media text is considered to be a non-trivial task due to the presence of slang words, misspellings, creative spellings and special elements such as hashtags, user mentions etc., ALI in multilingual environment becomes even more challenging task. In a highly multilingual society, code-mixing without affecting the underlying language sense has become a natural phenomenon. In such a dynamic environment, conversational text alone often fails to identify the underlying languages present in the text. This paper proposes various methods of exploiting social conversational features for enhancing ALI performance. Although social conversational features for ALI have been explored previously using methods like probabilistic language modeling, these models often fail to address issues related to code-mixing, phonetic typing, out-of-vocabulary etc. which are prevalent in a highly multilingual environment. This paper differs in the way the social conversational features are used to propose text refinement strategies that are suitable for ALI in highly multilingual environment. The contributions in this paper therefore includes the following. First, this paper analyzes the characteristics of various social conversational features by exploiting language usage patterns. Second, various methods of text refinement suitable for language identification are proposed. Third, the effects of the proposed refinement methods are investigated using various sentence level language identification frameworks. From various experimental observations over three conversational datasets collected from Facebook, Youtube and Twitter social media platforms, it is evident that our proposed method of ALI using social conversational features outperforms the baseline counterparts.

Read full abstract

In today's world, there is no shortage of information. However, for a specific information need, only a small subset of all of the available information will be useful. The field of information retrieval (IR) is the study of methods to provide users with that small subset of information relevant to their needs and to do so in a timely fashion. Information sources can take many forms, but this thesis will focus on text based information systems and investigate problems germane to the retrieval of written natural language documents. Central to these problems is the notion of topic. In other words, what are documents about? However, topics depend on the semantics of documents and retrieval systems are not endowed with knowledge of the semantics of natural language. The approach taken in this thesis will be to make use of probabilistic language models to investigate text based information retrieval and related problems. One such problem is the prediction of topic shifts in text, the topic segmentation problem. It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection. Two complementary sets of features are studied individually and then combined into a single language model. The language modeling approach allows this problem to be approached in a principled way without complex semantic modeling. Next, the problem of document retrieval in response to a user query will be investigated. Models of document indexing and document retrieval have been extensively studied over the past three decades. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. Much of the reason for this is that the indexing component requires inferences as to the semantics of documents. Instead, an approach to retrieval based on probabilistic language modeling will be presented. Models are estimated for each document individually. The approach to modeling is non-parametric and integrates the entire retrieval process into a single model. One advantage of this approach is that collection statistics, which are used heuristically for the assignment of concept probabilities in other probabilistic models, are used directly in the estimation of language model probabilities in this approach. The language modeling approach has been implemented and tested empirically and performs very well on standard test collections and query sets. In order to improve retrieval effectiveness, IR systems use additional techniques such as relevance feedback, unsupervised query expansion and structured queries. These and other techniques are discussed in terms of the language modeling approach and empirical results are given for several of the techniques developed. These results provide further proof of concept for the use of language models for retrieval tasks.

Read full abstract

Probabilistic Language Model Research Articles

Related Topics

Articles published on Probabilistic Language Model

GDTM: Graph-based Dynamic Topic Models

Visualization Classification and Prediction Based on Data Mining

Bayonet-corpus: a trajectory prediction method based on bayonet context and bidirectional GRU

Generalizing Long Short-Term Memory Network for Deep Learning from Generic Data

Model design for grammatical error identification in software requirements specification using statistics and rule-based techniques

Deep learning-based techniques to enhance the precision of phrase-based statistical machine translation system for Indian languages

Semantic Entropy in Language Comprehension

Evaluating information-theoretic measures of word prediction in naturalistic sentence reading

Detected text‐based image retrieval approach for textual images

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model With Attention

Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

Influence of social conversational features on language identification in highly multilingual online conversations

Automated Item Generation with Recurrent Neural Networks.

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

A Language Modeling Approach to Information Retrieval

Using stochastic language models (SLM) to map lexical, syntactic, and phonological information processing in the brain.

Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension

Parallel Sentiment Analysis with Storm

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Probabilistic Language Model Research Articles

Related Topics

Articles published on Probabilistic Language Model

GDTM: Graph-based Dynamic Topic Models

Visualization Classification and Prediction Based on Data Mining

Bayonet-corpus: a trajectory prediction method based on bayonet context and bidirectional GRU

Generalizing Long Short-Term Memory Network for Deep Learning from Generic Data

Model design for grammatical error identification in software requirements specification using statistics and rule-based techniques

Deep learning-based techniques to enhance the precision of phrase-based statistical machine translation system for Indian languages

Semantic Entropy in Language Comprehension

Evaluating information-theoretic measures of word prediction in naturalistic sentence reading

Detected text‐based image retrieval approach for textual images

CaptionNet: Automatic End-to-End Siamese Difference Captioning Model With Attention

Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

Influence of social conversational features on language identification in highly multilingual online conversations

Automated Item Generation with Recurrent Neural Networks.

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

A Language Modeling Approach to Information Retrieval

Using stochastic language models (SLM) to map lexical, syntactic, and phonological information processing in the brain.

Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension

Parallel Sentiment Analysis with Storm

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos