N-gram Model Research Articles

Investigation in the hybrid architectures for Natural Language Processing (NLP) requires overcoming complexity in various intellectual traditions pertaining to computer science, formal linguistics, logic, digital humanities, ethical issues and so on. NLP as a subfield of computer science and artificial intelligence is concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text (and speech) in order to create systems, such as: machine translation (converting from text in a source language to text in a target language), document summarization (converting from long texts into short texts), named entity recognition, predictive typing, et cetera. Undoubtedly, NLP phenomena have been implanted in our daily lives, for instance automatic Machine Translation (MT) is omnipresent in social media (or on the world wide web), virtual assistants (Siri, Cortana, Alexa, and so on) can recognize a natural voice or e-mail services use detection systems to filter out some spam messages. The purpose of this paper, however, is to outline the linguistic and NLP methods to textual processing. Therefore, the bag-of-n-grams concept will be discussed here as an approach to extract more details about the textual data in a string of a grouped words. The n-gram language model presented in this paper (that assigns probabilities to sequences of words in text corpora) is based on findings compiled in Sketch Engine, as well as samples of language data processed by means of NLTK library for Python. Why would one want to compute the probability of a word sequence? The answer is quite obvious – in various systems for performing tasks, the goal is to generate texts that are more fluent. Therefore, a particular component is required, which computes the probability of the output text. The idea is to collect information how frequently the n-grams occur in a large text corpus and use it to predict the next word. Counting the number of occurrences can also envisage certain drawbacks, for instance there are sometimes problems with sparsity or storage. Nonetheless, the language models and specific computing ‘recipes’ described in this paper can be used in many applications, such as machine translation, summarization, even dialogue systems, etc. Lastly, it has to be pointed out that this piece of writing is a part of an ongoing work tentatively termed as LADDER (Linguistic Analysis of Data in the Digital Era of Research) that touches upon the process of datacization[1] that might help to create an intelligent system of interdisciplinary information.

Read full abstract

The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.

Read full abstract

N-gram Model Research Articles

Related Topics

Articles published on N-gram Model

NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Aspect-Based Sentiment Analysis and Emotion Detection for Code-Mixed Review

A Sentiment Analysis Method of Capsule Network Based on BiLSTM

N-gram based Machine Translation for English-Assamese: Two Languages with High Syntactical Dissimilarity

Automated Misspelling Detection and Correction in Persian Clinical Text.

I Say, You Say, We Say

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Language Identification for Multilingual Sentiment Examination

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Evaluating Computational Language Models with Scaling Properties of Natural Language

A Higher-Order N-gram Model to Enhance Automatic Word Prediction for Assamese Sentences Containing Ambiguous Words

A distributed system for large-scale n-gram language models at Tencent

A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

The effects and preventability of 2627 patient safety incidents related to health information technology failures: a retrospective analysis of 10 years of incident reporting in England and Wales

Recurrent neural network with attention mechanism for language model

Literature Review of Sentiment Analysis Techniques for Microblogging Site

Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter

Automatic Dating of Medieval Charters from Denmark

Melodic patterns and tonal cadences: Bayesian learning of cadential categories from contrapuntal information

A Review towards the Sentiment Analysis Techniques for the Analysis of Twitter Data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

N-gram Model Research Articles

Related Topics

Articles published on N-gram Model

NLP ‘RECIPES’ FOR TEXT CORPORA: APPROACHES TO COMPUTING THE PROBABILITY OF A SEQUENCE OF TOKENS

Aspect-Based Sentiment Analysis and Emotion Detection for Code-Mixed Review

A Sentiment Analysis Method of Capsule Network Based on BiLSTM

N-gram based Machine Translation for English-Assamese: Two Languages with High Syntactical Dissimilarity

Automated Misspelling Detection and Correction in Persian Clinical Text.

I Say, You Say, We Say

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Language Identification for Multilingual Sentiment Examination

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Evaluating Computational Language Models with Scaling Properties of Natural Language

A Higher-Order N-gram Model to Enhance Automatic Word Prediction for Assamese Sentences Containing Ambiguous Words

A distributed system for large-scale n-gram language models at Tencent

A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

The effects and preventability of 2627 patient safety incidents related to health information technology failures: a retrospective analysis of 10 years of incident reporting in England and Wales

Recurrent neural network with attention mechanism for language model

Literature Review of Sentiment Analysis Techniques for Microblogging Site

Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter

Automatic Dating of Medieval Charters from Denmark

Melodic patterns and tonal cadences: Bayesian learning of cadential categories from contrapuntal information

A Review towards the Sentiment Analysis Techniques for the Analysis of Twitter Data