A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese
By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.
- Research Article
36
- 10.1093/bib/bbae040
- Jan 22, 2024
- Briefings in bioinformatics
Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
- Conference Article
48
- 10.1109/lisat.2017.8001979
- May 1, 2017
Most of the machine learning algorithms requires the input to be denoted as a fixed-length feature vector. In text classifications (bag-of-words) is a popular fixed-length features. Despite their simplicity, they are limited in many tasks; they ignore semantics of words and loss ordering of words. In this paper, we propose a simple and efficient neural language model for sentence-level classification task. Our model employs Recurrent Neural Network Language Model (RNN-LM). Particularly, Long Short-Term Memory (LSTM) over pre-trained word vectors obtained from unsupervised neural language model to capture semantics and syntactic information in a short sentence. We achieved outstanding empirical results on multiple benchmark datasets, IMDB Sentiment analysis dataset, and Stanford Sentiment Treebank (SSTb) dataset. The empirical results show that our model is comparable with neural methods and outperforms traditional methods in sentiment analysis task.
- Research Article
6
- 10.1016/j.specom.2004.10.017
- Jan 7, 2005
- Speech Communication
Post-dialogue confidence scoring for unsupervised statistical language model training
- Conference Article
76
- 10.1109/icassp.2009.4960579
- Apr 1, 2009
We measure the effects of a weak language model, estimated from as little as 100k words of text, on unsupervised acoustic model training and then explore the best method of using word confidences to estimate n-gram counts for unsupervised language model training. Even with 100k words of text and 10 hours of training data, unsupervised acoustic modeling is robust, with 50% of the gain recovered when compared to supervised training. For language model training, multiplying the word confidences together to get a weighted count produces the best reduction in WER by 2% over the baseline language model and 0.5% absolute over using unweighted transcripts. Oracle experiments show that a larger gain is possible, but better confidence estimation techniques are needed to identify correct n-grams.
- Research Article
4
- 10.1155/2022/8187680
- Feb 28, 2022
- Scientific Programming
Traditional Vietnamese word segmentation methods do not perform well in the face of Vietnamese ambiguity, in response to the enormous challenge posed by the scarcity of the Vietnamese corpus to language processing. We first investigated the most advanced deep neural network method. According to the ambiguity problem of Vietnamese word segmentation, we then proposed a Vietnamese word segmentation processing technology based on an improved long short-term memory neural network (LSTM), which is made up of an LSTM encoding and a CNN feature extraction portion. The previous important information is kept in the memory unit; the word segmentation processing task is refined into a classification problem and a sequence labeling problem, which can gain the useful features of the word segmentation character and word level automatically. The limitation of the local context window size is avoided, and the word segmentation processing task is refined into a classification problem and a sequence labeling problem. Finally, validated by a homemade Vietnamese news website crawler dataset, the experimental results show that, compared with the single LSTM, single CNN methods, and traditional methods, the performance improvement of our proposed method is more obvious. In the Vietnamese word separation task, the accuracy reaches 96.6%, the recall reaches 95.2%, and the F1 value reaches 96.3%, which is significantly better than the traditional methods CNN and LSTM.
- Conference Article
7
- 10.21437/interspeech.2004-488
- Oct 4, 2004
Statistical language models are widely used in automatic speech recognition in order to constrain the decoding of a sentence. Most of these models derive from the classical n-gram paradigm. However, the production of a word dends on a large set of linguistic features : lexical, syntactic, semantic, etc. Moreover, in some natural languages the gender and number of the left context affect the production of the next word. Therefore, it seems attractive to design a language model based on a variety of word features. We present in this paper a new statistical language model, called Statistical Feature Language Model, SFLM, based on this idea. In SFLM a word is considered as an array of linguistic features, and the model is defined in a way similar to the n-gram model. Experiments carried out for French and show an improvement in terms of perplexity and predicted words.
- Research Article
4
- 10.14483/23448393.11616
- Sep 12, 2017
- Ingeniería
Context: Automatic speech recognition requires the development of language and acoustic models for different existing dialects. The purpose of this research is the training of an acoustic model, a statistical language model and a grammar language model for the Spanish language, specifically for the dialect of the city of San Jose de Cucuta, Colombia, that can be used in a command control system. Existing models for the Spanish language have problems in the recognition of the fundamental frequency and the spectral content, the accent, pronunciation, tone or simply the language model for Cucuta's dialect.Method: in this project, we used Raspberry Pi B+ embedded system with Raspbian operating system which is a Linux distribution and two open source software, namely CMU-Cambridge Statistical Language Modeling Toolkit from the University of Cambridge and CMU Sphinx from Carnegie Mellon University; these software are based on Hidden Markov Models for the calculation of voice parameters. Besides, we used 1913 recorded audios with the voice of people from San Jose de Cucuta and Norte de Santander department. These audios were used for training and testing the automatic speech recognition system.Results: we obtained a language model that consists of two files, one is the statistical language model (.lm), and the other is the jsgf grammar model (.jsgf). Regarding the acoustic component, two models were trained, one of them with an improved version which had a 100 % accuracy rate in the training results and 83 % accuracy rate in the audio tests for command recognition. Finally, we elaborated a manual for the creation of acoustic and language models with CMU Sphinx software.Conclusions: The number of participants in the training process of the language and acoustic models has a significant influence on the quality of the voice processing of the recognizer. The use of a large dictionary for the training process and a short dictionary with the command words for the implementation is important to get a better response of the automatic speech recognition system. Considering the accuracy rate above 80 % in the voice recognition tests, the proposed models are suitable for applications oriented to the assistance of visual or motion impairment people.
- Conference Article
2
- 10.1109/icecta.2017.8251935
- Nov 1, 2017
Statistical N-grams language models (LMs) have shown to be very effective in natural language processing (NLP), particularly in automatic speech recognition (ASR) and machine translation. In fact, the successful impact of LMs promote to introduce efficient techniques as well as different types models in various linguistic applications. The LMs mainly include two types that are grammars and statistical language models that is also called N-grams. The main difference between grammars and statistical language models is that the statistical language models are based on the estimation of probabilities for words sequences while the grammars usually do not have probabilities. Despite there are many toolkits that are used to create LMs, however, this work employs two well-known language modeling toolkits with focus on the Arabic text. The implementing toolkits include the Carnegie Mellon University (CMU)-Cambridge Language Modeling Toolkit and the Cambridge University Hidden Markov Model Toolkit (HTK) language modeling toolkits. For clarification, we used a small Arabic text corpus to compute the N-grams for 1-gram, 2-gram, and 3-gram. In addition, this paper demonstrates the intermediate steps that are needed to generate the ARPA-format LMs using both toolkits.
- Research Article
116
- 10.1109/access.2022.3149798
- Jan 1, 2022
- IEEE Access
Learning human languages is a difficult task for a computer. However, Deep Learning (DL) techniques have enhanced performance significantly for almost all-natural language processing (NLP) tasks. Unfortunately, these models cannot be generalized for all the NLP tasks with similar performance. NLU (Natural Language Understanding) is a subset of NLP including tasks, like machine translation, dialogue-based systems, natural language inference, text entailment, sentiment analysis, etc. The advancement in the field of NLU is the collective performance enhancement in all these tasks. Even though MTL (Multi-task Learning) was introduced before Deep Learning, it has gained significant attention in the past years. This paper aims to identify, investigate, and analyze various language models used in NLU and NLP to find directions for future research. The Systematic Literature Review (SLR) is prepared using the literature search guidelines proposed by Kitchenham and Charters on various language models between 2011 and 2021. This SLR points out that the unsupervised learning method-based language models show potential performance improvement. However, they face the challenge of designing the general-purpose framework for the language model, which will improve the performance of multi-task NLU and the generalized representation of knowledge. Combining these approaches may result in a more efficient and robust multi-task NLU. This SLR proposes building steps for a conceptual framework to achieve goals of enhancing the performance of language models in the field of NLU.
- Conference Article
29
- 10.21437/interspeech.2006-330
- Sep 17, 2006
Within the EU Network of Excellence PASCAL, a challenge was organized to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, these are basic vocabulary units suitable for different tasks, such as speech and text understanding, machine translation, information retrieval, and statistical language modeling. Twelve research groups participated in the challenge and had submitted segmentation results obtained by their algorithms. In this paper, we evaluate the application of these segmentation algorithms to large vocabulary speech recognition using statistical n-gram language models based on the proposed word segments instead of entire words. Experiments were done for two agglutinative and morphologically rich languages: Finnish and Turkish. We also investigate combining various segmentations to improve the performance of the recognizer. Index Terms: speech recognition, language modelling, morphemes, unsupervised learning.
- Book Chapter
4
- 10.1007/978-3-030-90963-5_38
- Jan 1, 2021
With the rapid advancement of natural language processing (NLP) as a sub-field of artificial intelligence (AI), a number of unsupervised pre-trained language models trained on large corpus have become available (e.g. BERT and GPT-3). While these models have tremendous linguistic knowledge, a lot of other types of knowledge are embedded in them as well. We perform cross-culture analysis experiments using AI-based Masked Language Modeling (MLM) and GPT-based Generative Language Modeling (In-context learning modeling). The designed approach is to set up a cultural context in sentences with masked words (for MLM) or in a human-prompted text segment (for GPT-based NLG). Consequently, the predicted masked words or the machine generated stories will reflect measurable intercultural differences because language models are trained on different corpus in different languages, and on English corpus containing a significant amount of knowledge on foreign cultures. We show a variety of examples: geopolitical knowledge, holidays, gestures, customs, social norms, emotion schema, role schema, procedure schema, and emotion change detection based on a diplomatic speech. The deep learning neural network model encodes its knowledge in the weights of a neural network instead of as organized semantic concepts. The model can reflect biases brought in by the training data and can give us inaccurate or faulty answers. Overall, with the rapid advancement of language technology, pre-trained language models have grown more powerful, and have great potential to serve as a culturalization tool.
- Research Article
18
- 10.1016/j.neunet.2021.05.023
- May 25, 2021
- Neural Networks
Unsupervised multi-sense language models for natural language processing tasks
- Research Article
31
- 10.1145/1034780.1034781
- Jun 1, 2004
- ACM Transactions on Asian Language Information Processing
introduction Share on Introduction to the special issue on statistical language modeling Authors: Jianfeng Gao Microsoft Research Asia, Beijing, China Microsoft Research Asia, Beijing, ChinaView Profile , Chin-Yew Lin Information sciences institute, university of southern california, CA Information sciences institute, university of southern california, CAView Profile Authors Info & Claims ACM Transactions on Asian Language Information ProcessingVolume 3Issue 2June 2004 pp 87–93https://doi.org/10.1145/1034780.1034781Published:01 June 2004Publication History 4citation859DownloadsMetricsTotal Citations4Total Downloads859Last 12 Months3Last 6 weeks1 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my Alerts New Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access
- Conference Article
9
- 10.21437/interspeech.2011-416
- Aug 27, 2011
A Mobile Virtual Assistant (MVA) is a communication agent that recognizes and understands free speech, and performs actions such as retrieving information and completing transactions. One essential characteristic of MVAs is their ability to learn and adapt without supervision. This paper describes our ongoing research in developing more intelligent MVAs that recognize and understand very large vocabulary speech input across a variety of tasks. In particular, we present our architecture for unsupervised acoustic and language model adaptation. Experimental results show that unsupervised acoustic model learning approaches the performance of supervised learning when adapting on 40-50 device-specific utterances. Unsupervised language model learning results in an 8% absolute drop in word error rate.
- Conference Article
6
- 10.23919/spa.2017.8166885
- Sep 1, 2017
The article presents statistical word-based and phoneme-based language models for automatic speech recognition application in Polish. Appropriate orthographic and phonemic language corpora allow to perform statistical analysis of the language and to develop statistical word-based and phoneme-based language models. Development of statistical language models helps to predict a sequence of recognized words and phonemes. Developed statistical language models have been compared and the best of them has been proposed as the best suited for automatic speech recognition application for Polish. Word-based and phoneme-based language models can be used to develop hybrid language models and effectively contribute to improve speech recognition effectiveness based on statistical methods. The achieved research results and conclusions can also be applied to speech recognition application for other languages.