What does a language model know about proteins?

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

What does a language model know about proteins?

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.14483/23448393.11616
Modelo Acústico y de Lenguaje del Idioma Español para el dialecto Cucuteño, Orientado al Reconocimiento Automático del Habla
  • Sep 12, 2017
  • Ingeniería
  • Juan David Celis Nuñez + 4 more

Context: Automatic speech recognition requires the development of language and acoustic models for different existing dialects. The purpose of this research is the training of an acoustic model, a statistical language model and a grammar language model for the Spanish language, specifically for the dialect of the city of San Jose de Cucuta, Colombia, that can be used in a command control system. Existing models for the Spanish language have problems in the recognition of the fundamental frequency and the spectral content, the accent, pronunciation, tone or simply the language model for Cucuta's dialect.Method: in this project, we used Raspberry Pi B+ embedded system with Raspbian operating system which is a Linux distribution and two open source software, namely CMU-Cambridge Statistical Language Modeling Toolkit from the University of Cambridge and CMU Sphinx from Carnegie Mellon University; these software are based on Hidden Markov Models for the calculation of voice parameters. Besides, we used 1913 recorded audios with the voice of people from San Jose de Cucuta and Norte de Santander department. These audios were used for training and testing the automatic speech recognition system.Results: we obtained a language model that consists of two files, one is the statistical language model (.lm), and the other is the jsgf grammar model (.jsgf). Regarding the acoustic component, two models were trained, one of them with an improved version which had a 100 % accuracy rate in the training results and 83 % accuracy rate in the audio tests for command recognition. Finally, we elaborated a manual for the creation of acoustic and language models with CMU Sphinx software.Conclusions: The number of participants in the training process of the language and acoustic models has a significant influence on the quality of the voice processing of the recognizer. The use of a large dictionary for the training process and a short dictionary with the command words for the implementation is important to get a better response of the automatic speech recognition system. Considering the accuracy rate above 80 % in the voice recognition tests, the proposed models are suitable for applications oriented to the assistance of visual or motion impairment people.

  • Conference Article
  • Cite Count Icon 6
  • 10.21437/interspeech.2004-488
Statistical feature language model
  • Oct 4, 2004
  • Salma Jamoussi + 3 more

Statistical language models are widely used in automatic speech recognition in order to constrain the decoding of a sentence. Most of these models derive from the classical n-gram paradigm. However, the production of a word dends on a large set of linguistic features : lexical, syntactic, semantic, etc. Moreover, in some natural languages the gender and number of the left context affect the production of the next word. Therefore, it seems attractive to design a language model based on a variety of word features. We present in this paper a new statistical language model, called Statistical Feature Language Model, SFLM, based on this idea. In SFLM a word is considered as an array of linguistic features, and the model is defined in a way similar to the n-gram model. Experiments carried out for French and show an improvement in terms of perplexity and predicted words.

  • Book Chapter
  • 10.1007/978-981-10-6496-8_27
Microblog Search Method Based on Neural Network Language Model
  • Sep 21, 2017
  • Jincai Lai + 3 more

Deep neural network language model has gained significant development among natural language processing (NLP) in recent years. In this paper, we focused on using neural language model (NNLM) to enhance microblog search. This paper proposed a microblog search method based on neural network language model (NBSM). Firstly, we train neural network language model based on microblog data, so as to get the distributed representation of words which may contain internal express model of microblog. Then, we use the distributed representation of words to get the expanding words of users’ searching words. Finally, we re-rank microblog search results combining deep sematic text similarity and social signal features. The method we proposed can effectively obtain microblog express model, and its search result can reflect the social hot-topics of the topic related to users searching words. Experiment results show that the proposed method yields significant improvements over state-of-arts methods and significantly improves the user’s search experience.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icecta.2017.8251935
Exploring the language modeling toolkits for Arabic text
  • Nov 1, 2017
  • Fawaz S Al-Anzi + 1 more

Statistical N-grams language models (LMs) have shown to be very effective in natural language processing (NLP), particularly in automatic speech recognition (ASR) and machine translation. In fact, the successful impact of LMs promote to introduce efficient techniques as well as different types models in various linguistic applications. The LMs mainly include two types that are grammars and statistical language models that is also called N-grams. The main difference between grammars and statistical language models is that the statistical language models are based on the estimation of probabilities for words sequences while the grammars usually do not have probabilities. Despite there are many toolkits that are used to create LMs, however, this work employs two well-known language modeling toolkits with focus on the Arabic text. The implementing toolkits include the Carnegie Mellon University (CMU)-Cambridge Language Modeling Toolkit and the Cambridge University Hidden Markov Model Toolkit (HTK) language modeling toolkits. For clarification, we used a small Arabic text corpus to compute the N-grams for 1-gram, 2-gram, and 3-gram. In addition, this paper demonstrates the intermediate steps that are needed to generate the ARPA-format LMs using both toolkits.

  • Research Article
  • Cite Count Icon 43
  • 10.1111/epi.17570
Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
  • Mar 13, 2023
  • Epilepsia
  • Christian M Boßelmann + 2 more

Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?

  • Research Article
  • 10.1145/2422256.2422274
Improving the effectiveness of language modeling approaches to information retrieval
  • Dec 21, 2012
  • ACM SIGIR Forum
  • Yuanhua Lv

Improving the effectiveness of language modeling approaches to information retrieval

  • Conference Article
  • Cite Count Icon 32
  • 10.1109/slt.2018.8639699
Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance
  • Dec 1, 2018
  • Jesse Emond + 4 more

Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.

  • Research Article
  • Cite Count Icon 2
  • 10.11591/ijece.v10i2.pp2102-2109
Improving the role of language model in statistical machine translation (Indonesian-Javanese)
  • Apr 1, 2020
  • International Journal of Electrical and Computer Engineering (IJECE)
  • Herry Sujaini

The statistical machine translation (SMT) is widely used by researchers and practitioners in recent years. SMT works with quality that is determined by several important factors, two of which are language and translation model. Research on improving the translation model has been done quite a lot, but the problem of optimizing the language model for use on machine translators has not received much attention. On translator machines, language models usually use trigram models as standard. In this paper, we conducted experiments with four strategies to analyze the role of the language model used in the Indonesian-Javanese translation machine and show improvement compared to the baseline system with the standard language model. The results of this research indicate that the use of 3-gram language models is highly recommended in SMT.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icassp.2019.8682606
A Unified Framework for Feature-based Domain Adaptation of Neural Network Language Models
  • May 1, 2019
  • Michael Hentschel + 4 more

An important task for language models is the adaptation of general-domain models to specific target domains. For neural network-based language models, feature-based domain adaptation has been a popular method in previous research. Conventional methods use an adaptation feature providing context information that is calculated from a topic model. However, such a topic model needs to be trained separately from the language model. To unify the language and context model training, we present an approach that combines an extractor network and a domain adaptation layer. The extractor network learns a context representation from a fixed-size window of past words and provides the context information for the adaptation layer. The benefit of our method is that the extractor network can be trained jointly with the language model in a single training step. Our proposed method showed superior performance over conventional domain adaptation with topic features on a dataset of TED talks with respect to perplexity and word error rate after 100-best rescoring.

  • Research Article
  • Cite Count Icon 2
  • 10.28932/jutisi.v6i2.2684
Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia
  • Aug 10, 2020
  • Jurnal Teknik Informatika dan Sistem Informasi
  • Andreas Widjaja + 1 more

Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.

  • Conference Article
  • Cite Count Icon 5
  • 10.3115/1075168.1075172
The state of the art in language modeling
  • Jan 1, 2003
  • Joshua Goodman

This tutorial will cover the state-of-the-art in language modeling. Language models give the probability of word sequences, i.e. recognize is much more probable than wreck a nice beach. While most widely known for their use in speech recognition, language models are useful in a large number of areas, including information retrieval, machine translation, handwriting recognition, context-sensitive spelling correction, and text entry for Chinese and Japanese or on small input devices. Many language modeling techniques can be applied to other areas or to modeling any discrete sequence. This tutorial should be accessible to anyone with a basic knowledge of probability.The most basic language models -- n-gram models -- essentially just count occurrences of words in training data. I will describe five relatively simple improvements over this baseline: smoothing, caching, skipping, sentence-mixture models, and clustering. I will talk a bit about the applications of language modeling and then I will quickly describe other recent promising work, and available tools and resources.I will begin by describing conventional-style language modeling techniques.• Smoothing addresses the problem of data sparsity: there is rarely enough data to accurately estimate the parameters of a language model. Smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. I will describe two classic techniques -- deleted interpolation and Katz (or Good-Turing) smoothing -- and one recent technique, Modified Kneser-Ney smoothing, which is the best known.• Caching is a widely used technique that uses the observation that recently observed words are likely to occur again. Models from recently observed data can be combined with more general models to improve performance.• Skipping models use the observation that even words that are not directly adjacent to the target word contain useful information.• Sentence-mixture models use the observation that there are many different kinds of sentences. By modeling each sentence type separately, performance is improved.• Clustering is one of the most useful language modeling techniques. Words can be grouped together into clusters through various automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word. Clustering can be used to make smaller models or better performing ones. I will talk briefly about clustering issues specific to the huge amounts of data used in language modeling (hundreds of millions of words) to form thousands of clusters.I will then talk about other language modeling applications, with an emphasis on information retrieval, but also mentioning spelling correction, machine translation, and entering text in Chinese or Japanese.I will briefly describe some recent successful techniques, including Bellegarda's work using latent semantic analysis and Wang's SuperARV language models. Finally, I will also talk about some practical aspects of language modeling. I will describe how freely available, off-the-shelf tools can be used to easily build language models, where to get data to train a language model, and how to use methods such as count cutoffs or relative-entropy techniques to prune language models.Those who attend the tutorial should walk away with a broad understanding of current language modeling techniques, and the background needed to build their own language models, and choose the right language model techniques for their applications.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/icassp.2019.8683481
Improvements to N-gram Language Model Using Text Generated from Neural Language Model
  • May 1, 2019
  • Masayuki Suzuki + 4 more

Although neural language models have emerged, n-gram language models are still used for many speech recognition tasks. This paper proposes four methods to improve n-gram language models using text generated from a recurrent neural network language model (RNNLM). First, we use multiple RNNLMs from different domains instead of a single RNNLM. The final n-gram language model is obtained by interpolating generated n-gram models from each domain. Second, we use subwords instead of words for RNNLM to reduce the out-of-vocabulary rate. Third, we generate text templates using an RNNLM for template-based data augmentation for named entities. Fourth, we use both forward RNNLM and backward RNNLM to generate text. We found that these four methods improved performance of speech recognition up to 4% relative in various tasks.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.cose.2024.103947
Fuzzing JavaScript engines with a syntax-aware neural program model
  • Jun 8, 2024
  • Computers & Security
  • Haoran Xu + 5 more

Fuzzing JavaScript engines with a syntax-aware neural program model

  • Conference Article
  • Cite Count Icon 4
  • 10.23919/spa.2017.8166885
Polish language modelling for speech recognition application
  • Sep 1, 2017
  • Piotr Klosowski

The article presents statistical word-based and phoneme-based language models for automatic speech recognition application in Polish. Appropriate orthographic and phonemic language corpora allow to perform statistical analysis of the language and to develop statistical word-based and phoneme-based language models. Development of statistical language models helps to predict a sequence of recognized words and phonemes. Developed statistical language models have been compared and the best of them has been proposed as the best suited for automatic speech recognition application for Polish. Word-based and phoneme-based language models can be used to develop hybrid language models and effectively contribute to improve speech recognition effectiveness based on statistical methods. The achieved research results and conclusions can also be applied to speech recognition application for other languages.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icassp.2002.5743835
Rescoring effectiveness of language models using different levels of knowledge and their integration
  • May 1, 2002
  • Wen Wang + 2 more

In this paper, we compare the efficacy of a variety of language models (LMs) for rescoring word graphs and N-best lists generated by a large vocabulary continuous speech recognizer. These LMs differ based on the level of knowledge used (word, lexical features, syntax) and the type of integration of that knowledge (tight or loose). The trigram LM incorporates word level information; our Part-of-Speech (POS) LM uses word and lexical class information in a tightly coupled way; our new SuperARV LM tightly integrates word, a richer set of lexical features than POS, and syntactic dependency information; and the Parser LM integrates some limited word information, POS, and syntactic information. We also investigate LMs created using a linear interpolation of LM pairs. When comparing each LM on the task of rescoring word graphs or N-best lists for the Wall Street Journal (WSJ) 5k- and 20k- vocabulary test sets, the SuperARV LM always achieves the greatest reduction in word error rate (WER) and the greatest increase in sentence accuracy (SAC). On the 5k test sets, the SuperARV LM obtains more than a 10% relative reduction in WER compared to the trigram LM, and on the 20k test set more than 2%. Additionally, the SuperARV LM performs comparably to or better than the interpolated LMs. Hence, we conclude that the tight coupling of knowledge from all three levels is an effective method of constructing high quality LMs.

More from: Nature methods
  • New
  • Research Article
  • 10.1038/s41592-025-02877-y
Squidiff: predicting cellular development and responses to perturbations using a diffusion model.
  • Nov 3, 2025
  • Nature methods
  • Siyu He + 13 more

  • New
  • Research Article
  • 10.1038/s41592-025-02878-x
Predicting cellular responses with conditional diffusion models.
  • Nov 3, 2025
  • Nature methods

  • New
  • Research Article
  • 10.1038/s41592-025-02860-7
A portable poison exon for small-molecule control of mammalian gene expression.
  • Nov 3, 2025
  • Nature methods
  • Qian Hou + 5 more

  • New
  • Research Article
  • 10.1038/s41592-025-02865-2
Whole-brain reconstruction of fiber tracts based on cytoarchitectonic organization.
  • Nov 3, 2025
  • Nature methods
  • Yue Zhang + 25 more

  • New
  • Research Article
  • 10.1038/s41592-025-02863-4
ESPRESSO: spatiotemporal omics based on organelle phenotyping.
  • Nov 3, 2025
  • Nature methods
  • Lorenzo Scipioni + 10 more

  • New
  • Research Article
  • 10.1038/s41592-025-02866-1
High-resolution brain mapping with cytoarchitecture-based link estimation.
  • Nov 3, 2025
  • Nature methods

  • New
  • Research Article
  • 10.1038/s41592-025-02855-4
STORIES: learning cell fate landscapes from spatial transcriptomics using optimal transport.
  • Nov 3, 2025
  • Nature methods
  • Geert-Jan Huizing + 5 more

  • New
  • Research Article
  • 10.1038/s41592-025-02864-3
Single-cell high-dimensional phenotyping in space and time based on organelle features.
  • Nov 3, 2025
  • Nature methods

  • New
  • News Article
  • 10.1038/s41592-025-02882-1
When fieldwork calls your name.
  • Oct 30, 2025
  • Nature methods
  • Vivien Marx

  • New
  • Research Article
  • 10.1038/s41592-025-02889-8
High-resolution imaging mass cytometry to map subcellular structures.
  • Oct 30, 2025
  • Nature methods
  • Alina Bollhagen + 5 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon