N-gram Language Models Research Articles

How do humans learn the regularities of their complex noisy world in a robust manner? There is ample evidence that much of this learning and development occurs in an unsupervised fashion via interactions with the environment. Both the structure of the world as well as the brain appear hierarchical in a number of ways, and structured hierarchical representations offer potential benefits for efficient learning and organization of knowledge, such as concepts (patterns) sharing parts (subpatterns), and for providing a foundation for symbolic computation and language. A major question arises: what drives the processes behind acquiring such hierarchical spatiotemporal concepts? We posit that the goal of advancing one's predictions is a major driver for learning such hierarchies and introduce an information-theoretic score that shows promise in guiding the processes, and, in particular, motivating the learner to build larger concepts. We have been exploring the challenges of building an integrated learning and developing system within the framework of prediction games, wherein concepts serve as (1) predictors, (2) targets of prediction, and (3) building blocks for future higher-level concepts. Our current implementation works on raw text: it begins at a low level, such as characters, which are the hardwired or primitive concepts, and grows its vocabulary of networked hierarchical concepts over time. Concepts are strings or n-grams in our current realization, but we hope to relax this limitation, e.g., to a larger subclass of finite automata. After an overview of the current system, we focus on the score, named CORE. CORE is based on comparing the prediction performance of the system with a simple baseline system that is limited to predicting with the primitives. CORE incorporates a tradeoff between how strongly a concept is predicted (or how well it fits its context, i.e., nearby predicted concepts) vs. how well it matches the (ground) "reality," i.e., the lowest level observations (the characters in the input episode). CORE is applicable to generative models such as probabilistic finite state machines (beyond strings). We highlight a few properties of CORE with examples. The learning is scalable and open-ended. For instance, thousands of concepts are learned after hundreds of thousands of episodes. We give examples of what is learned, and we also empirically compare with transformer neural networks and n-gram language models to situate the current implementation with respect to state-of-the-art and to further illustrate the similarities and differences with existing techniques. We touch on a variety of challenges and promising future directions in advancing the approach, in particular, the challenge of learning concepts with a more sophisticated structure.

Medical imaging is critical in clinical practice, and high value radiological reports can positively assist clinicians. However, there is a lack of methods for determining the value of reports. The purpose of this study was to establish an ensemble learning classification model using natural language processing (NLP) applied to the Chinese free text of radiological reports to determine their value for liver lesion detection in patients with colorectal cancer (CRC). Radiological reports of upper abdominal computed tomography (CT) and magnetic resonance imaging (MRI) were divided into five categories according to the results of liver lesion detection in patients with CRC. The NLP methods including word segmentation, stop word removal, and n-gram language model establishment were applied for each dataset. Then, a word-bag model was built, high-frequency words were selected as features, and an ensemble learning classification model was constructed. Several machine learning methods were applied, including logistic regression (LR), random forest (RF), and so on. We compared the accuracy between priori choosing pertinent word strings and our machine language methodologies. The dataset of 2790 patients included CT without contrast (10.2%), CT with/without contrast (73.3%), MRI without contrast (1.8%), and MRI with/without contrast (14.6%). The ensemble learning classification model determined the value of reports effectively, reaching 95.91% in the CT with/without contrast dataset using XGBoost. The logistic regression, random forest, and support vector machine also achieved good classification accuracy, reaching 95.89%, 95.04%, and 95.00% respectively. The results of XGBoost were visualized using a confusion matrix. The numbers of errors in categories I, II and V were very small. ELI5 was used to select important words for each category. Words such as "no abnormality", "suggest", "fatty liver", and "transfer" showed a relatively large degree of positive correlation with classification accuracy. The accuracy based on string pattern search method model was lower than that of machine learning. The learning classification model based on NLP was an effective tool for determining the value of radiological reports focused on liver lesions. The study made it possible to analyze the value of medical imaging examinations on a large scale.

N-gram Language Models Research Articles

Related Topics

Articles published on N-gram Language Models

Automatic Semantic Annotation of Indonesian Language Phrase Using N-Gram Language Model

English grammar intelligent error correction technology based on the n-gram language model

Arithmetic N-gram: an efficient data compression technique

A search tool based on language modelling developed for The Index of Middle English Prose

N-gram Language Model for Chinese Function-word-centered Patterns

Speech Recognition Enhancement and Compression Perception in Russian Translation Teaching Cooperative System Application

Intelligent Translation System Aiding High-Quality Writing in English in the Age of the Internet

Research on the Influence of Socialist Core Value System Construction on College Students’ Ideological and Political Education in the Context of Deep Learning

A search tool based on language modelling developed for The Index of Middle English Prose.

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Training RNN language models on uncertain ASR hypotheses in limited data scenarios

High-order interaction feature selection for classification learning: A robust knowledge metric perspective

Self-correction of automatic speech recognition

An information theoretic score for learning hierarchical concepts.

Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction

An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters

Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model

Central Kurdish Automatic Speech Recognition using Deep Learning

Using a classification model for determining the value of liver radiological reports of patients with colorectal cancer.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

N-gram Language Models Research Articles

Related Topics

Articles published on N-gram Language Models

Automatic Semantic Annotation of Indonesian Language Phrase Using N-Gram Language Model

English grammar intelligent error correction technology based on the n-gram language model

Arithmetic N-gram: an efficient data compression technique

A search tool based on language modelling developed for The Index of Middle English Prose

N-gram Language Model for Chinese Function-word-centered Patterns

Speech Recognition Enhancement and Compression Perception in Russian Translation Teaching Cooperative System Application

Intelligent Translation System Aiding High-Quality Writing in English in the Age of the Internet

Research on the Influence of Socialist Core Value System Construction on College Students’ Ideological and Political Education in the Context of Deep Learning

A search tool based on language modelling developed for The Index of Middle English Prose.

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling

Training RNN language models on uncertain ASR hypotheses in limited data scenarios

High-order interaction feature selection for classification learning: A robust knowledge metric perspective

Self-correction of automatic speech recognition

An information theoretic score for learning hierarchical concepts.

Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction

An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm

An End-to-End Transformer-Based Automatic Speech Recognition for Qur’an Reciters

Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model

Central Kurdish Automatic Speech Recognition using Deep Learning

Using a classification model for determining the value of liver radiological reports of patients with colorectal cancer.