N-gram Model Research Articles

Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class-based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order. DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18367

Read full abstract

Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .

Read full abstract

N-gram Model Research Articles

Related Topics

Articles published on N-gram Model

Evaluation of Language Models over Croatian Newspaper Texts

Effective computational detection of piRNAs using n-gram models and support vector machine

Detecting Deceptive Opinions: Intra and Cross-Domain Classification Using an Efficient Representation

A Generative Model of Phonotactics

A Novel Approach Research on Chinese Language Model Fusion Based on RNN

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

Classification of malware families based on runtime behaviors

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

Simulating melodic and harmonic expectations for tonal cadences using probabilistic models

A text representation model using Sequential Pattern-Growth method

DERIN: A data extraction method based on rendering information and n-gram

A hybrid neural network hidden Markov model approach for automatic story segmentation

From language identification to language distance

Learning Trans-Dimensional Random Fields with Applications to Language Modeling.

Sequence Similarity Parallelization over Heterogeneous Computer Clusters Using Data Parallel Programming Model

A recommendation engine for travel products based on topic sequential patterns

Proposal: A Hybrid Dictionary Modelling Approach for Malay Tweet Normalization

Towards Learning Word Representation

Sentiment analysis using a random forest classifier on turkish web comments

Sentiment Summerization and Analysis of Sindhi Text

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

N-gram Model Research Articles

Related Topics

Articles published on N-gram Model

Evaluation of Language Models over Croatian Newspaper Texts

Effective computational detection of piRNAs using n-gram models and support vector machine

Detecting Deceptive Opinions: Intra and Cross-Domain Classification Using an Efficient Representation

A Generative Model of Phonotactics

A Novel Approach Research on Chinese Language Model Fusion Based on RNN

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

Classification of malware families based on runtime behaviors

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

Simulating melodic and harmonic expectations for tonal cadences using probabilistic models

A text representation model using Sequential Pattern-Growth method

DERIN: A data extraction method based on rendering information and n-gram

A hybrid neural network hidden Markov model approach for automatic story segmentation

From language identification to language distance

Learning Trans-Dimensional Random Fields with Applications to Language Modeling.

Sequence Similarity Parallelization over Heterogeneous Computer Clusters Using Data Parallel Programming Model

A recommendation engine for travel products based on topic sequential patterns

Proposal: A Hybrid Dictionary Modelling Approach for Malay Tweet Normalization

Towards Learning Word Representation

Sentiment analysis using a random forest classifier on turkish web comments

Sentiment Summerization and Analysis of Sindhi Text