Articles published on Character-level Language Model
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
18 Search results
Sort by Recency
- Research Article
- 10.3389/fcomp.2025.1626899
- Aug 22, 2025
- Frontiers in Computer Science
- Zhigao Huang + 2 more
Rotary Positional Embedding (RoPE) is a widely used technique in Transformers, influenced by the hyperparameter theta (θ). However, the impact of varying *fixed* theta values, especially the trade-off between performance and efficiency on tasks like character-level modeling, remains under-explored. This paper presents a systematic evaluation of RoPE with fixed theta values (ranging from 500 to 50,000) on a character-level GPT model across three datasets: Tiny Shakespeare, Enwik8, and Text8, compared against the standard θ = 10, 000 baseline. However, all non-default theta configurations incur significant computational overhead: inference speed is approximately halved across all datasets, suggesting implementation—specific bottlenecks rather than theta—dependent costs. This study quantifies a critical performance—efficiency trade-off when tuning fixed RoPE theta. Our findings emphasize the practical need to balance generalization gains with computational budgets during model development and deployment, contributing empirical insights into RoPE hyperparameter sensitivity and demonstrating that optimal theta selection is highly dataset-dependent. These insights suggest that future positional encoding designs could benefit from adaptive θ scheduling or dataset-specific θ optimization strategies to maximize both performance and computational efficiency.
- Research Article
- 10.3389/frai.2025.1628943
- Aug 7, 2025
- Frontiers in Artificial Intelligence
- Zhigao Huang + 2 more
We propose Spectral Momentum Integration (SMI), an optimization enhancement that processes gradients in both frequency and time domains. SMI applies the Fast Fourier Transform to selectively filter gradient frequency components before blending them with original gradients using an adaptive scheduling mechanism. Experiments on a character-level language model demonstrate that SMI can achieve inference acceleration while maintaining model performance. Our approach integrates with existing optimizers without modifying model architecture, though it introduces computational overhead and hyperparameter complexity. While our current validation is limited to small-scale experiments, SMI provides a proof-of-concept for incorporating frequency-domain processing into neural network optimization, suggesting potential for broader applications pending large-scale validation.
- Research Article
2
- 10.3390/info16060475
- Jun 6, 2025
- Information
- Zhigao Huang + 2 more
Deep neural networks are often susceptible to overfitting, necessitating effective regularization techniques. This paper introduces Spectral Adaptive Dropout, a novel frequency-based regularization technique that dynamically adjusts dropout rates based on the spectral characteristics of network gradients. The proposed approach addresses the limitations of traditional dropout methods by adaptively targeting high-frequency components that typically contribute to overfitting while preserving essential low-frequency information. Through extensive experimentation on character-level language modeling tasks, the study demonstrates that the method achieves a 1.10% improvement in validation loss while maintaining competitive inference speeds. Thise research explores several implementations including FFT-based analysis, wavelet decomposition, and per-attention-head adaptation, culminating in an optimized approach that balances computational efficiency with regularization effectiveness. Our results highlight the significant potential of incorporating frequency-domain information into regularization strategies for deep neural networks.
- Research Article
2
- 10.70179/grdjev09i120213
- Dec 1, 2024
- Global Research and Development Journals
- Rajesh Kumar Malviya + 3 more
A method of evolving deep learning architectures using genetic algorithms is presented. The method is a first step towards a low-cost evolutionary search for task-specific neural networks. We evolve task-specific model architectures optimized for fast execution and low error on several standard machine learning tasks: image classification, character-level language modeling, and solving the cart pole problem. We also introduce a simple variation of the method that is capable of evolving neural networks with recurrent connections of varying depth and length and show performance on a word-level language modeling task. The method is implemented in an open-source library. We hope that the ability to run an evolutionary search at this scale will make it possible for a wide audience to develop deep learning architectures that are specialized for a variety of tasks and to develop many interesting novel architectural features. A new method that uses evolutionary search to directly modify existing neural network architectures to perform a specific task is presented. We demonstrate that task-specific specialization of deep learning models can be useful in practice. We modify convolutional neural networks, residual networks, and an LSTM variant to perform various tasks, and show that specialized networks often perform better than models trained from scratch that have many more parameters and much larger training time. For example, on the object recognition task, a specialized model is built by training a base network to predict object position and then applying a series of genetic search operations to squeeze the network and fit new final layer weights to the output. The specialized model is 8 times faster and has 13% lower error, despite being 17 times smaller than a fully trained larger and slower network.
- Research Article
7
- 10.1162/tacl_a_00651
- Apr 16, 2024
- Transactions of the Association for Computational Linguistics
- Lukas Edman + 4 more
Abstract Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
- Research Article
14
- 10.1145/3483446
- Dec 13, 2021
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Deepang Raval + 3 more
We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.
- Research Article
7
- 10.1109/tcbb.2021.3109557
- Nov 1, 2021
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
- Eric Chen + 4 more
Short-read DNA sequencing instruments can yield over 1012 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as “gaps”. Here, we introduce GapPredict – An implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly.
- Research Article
8
- 10.5281/zenodo.6580098
- Nov 1, 2020
- Zenodo (CERN European Organization for Nuclear Research)
- Miquel Esplà-Gomis + 3 more
This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.
- Research Article
112
- 10.1038/s41467-020-18959-8
- Oct 9, 2020
- Nature Communications
- Sun-Ting Tsai + 2 more
Recurrent neural networks have led to breakthroughs in natural language processing and speech recognition. Here we show that recurrent networks, specifically long short-term memory networks can also capture the temporal evolution of chemical/biophysical trajectories. Our character-level language model learns a probabilistic model of 1-dimensional stochastic trajectories generated from higher-dimensional dynamics. The model captures Boltzmann statistics and also reproduces kinetics across a spectrum of timescales. We demonstrate how training the long short-term memory network is equivalent to learning a path entropy, and that its embedding layer, instead of representing contextual meaning of characters, here exhibits a nontrivial connectivity between different metastable states in the underlying physical system. We demonstrate our model’s reliability through different benchmark systems and a force spectroscopy trajectory for multi-state riboswitch. We anticipate that our work represents a stepping stone in the understanding and use of recurrent neural networks for understanding the dynamics of complex stochastic molecular systems.
- Research Article
1
- 10.1609/aaai.v34i04.5958
- Apr 3, 2020
- Proceedings of the AAAI Conference on Artificial Intelligence
- Fandong Meng + 3 more
Recurrent neural networks (RNNs) have been widely used to deal with sequence learning problems. The input-dependent transition function, which folds new observations into hidden states to sequentially construct fixed-length representations of arbitrary-length sequences, plays a critical role in RNNs. Based on single space composition, transition functions in existing RNNs often have difficulty in capturing complicated long-range dependencies. In this paper, we introduce a new Multi-zone Unit (MZU) for RNNs. The key idea is to design a transition function that is capable of modeling multiple space composition. The MZU consists of three components: zone generation, zone composition, and zone aggregation. Experimental results on multiple datasets of the character-level language modeling task and the aspect-based sentiment analysis task demonstrate the superiority of the MZU.
- Research Article
3
- 10.6919/icje.202001_6(1).0028
- Jan 1, 2020
- International Core Journal of Engineering
- Chaoju Hu + 1 more
As a basic task in the field of natural language processing, named entity recognition plays an important role in text data processing tasks. Extracting features from the original text can be considered as the first step in the identification of named entities, but on this basic issue, traditional research still stays at the coarser granularity of words. Unlike traditional research, this paper focuses on finer granularity-character-level named entity recognition research. In order to fully extract the character-level feature representation from the character-level language model, this paper uses CNN and BiLSTM to perform feature extraction together, and introduces the attention mechanism to achieve more effective combination of character features and word features, then combines with BiLSTM-CRF to construct a complete end-to-end deep learning model (At- BiLSTM-CNNs-CRF). The experimental results show that its recognition ability exceeds most deep learning models.
- Research Article
9
- 10.1162/tacl_a_00283
- Nov 1, 2019
- Transactions of the Association for Computational Linguistics
- Michael Hahn + 1 more
Recurrent neural networks (RNNs) have reached striking performance in many natural language processing tasks. This has renewed interest in whether these generic sequence processing devices are inducing genuine linguistic knowledge. Nearly all current analytical studies, however, initialize the RNNs with a vocabulary of known words, and feed them tokenized input during training. We present a multi-lingual study of the linguistic knowledge encoded in RNNs trained as character-level language models, on input data with word boundaries removed. These networks face a tougher and more cognitively realistic task, having to discover any useful linguistic unit from scratch based on input statistics. The results show that our “near tabula rasa” RNNs are mostly able to solve morphological, syntactic and semantic tasks that intuitively presuppose word-level knowledge, and indeed they learned, to some extent, to track word boundaries. Our study opens the door to speculations about the necessity of an explicit, rigid word lexicon in language learning and usage.
- Research Article
373
- 10.1609/aaai.v33i01.33013159
- Jul 17, 2019
- Proceedings of the AAAI Conference on Artificial Intelligence
- Rami Al-Rfou + 4 more
LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model (Vaswani et al. 2017) with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.
- Book Chapter
2
- 10.3233/faia190328
- Jan 1, 2019
- Frontiers in artificial intelligence and applications
- Smywiński-Pohl Aleksander + 3 more
Application of Character-Level Language Models in the Domain of Polish Statutory Law
- Research Article
46
- 10.1016/j.patrec.2018.09.006
- Sep 5, 2018
- Pattern Recognition Letters
- Fréderic Godin + 3 more
Dual Rectified Linear Units (DReLUs): A replacement for tanh activation functions in Quasi-Recurrent Neural Networks
- Research Article
6
- 10.1016/j.patrec.2018.06.023
- Jun 28, 2018
- Pattern Recognition Letters
- Tehseen Zia
Hierarchical recurrent highway networks
- Research Article
77
- 10.1016/j.neucom.2018.03.020
- Mar 14, 2018
- Neurocomputing
- Wei Xia + 5 more
Novel architecture for long short-term memory used in question classification
- Research Article
25
- 10.1121/1.4768800
- Jan 1, 2013
- The Journal of the Acoustical Society of America
- Xunying Liu + 3 more
Mandarin Chinese is based on characters which are syllabic in nature and morphological in meaning. All spoken languages have syllabiotactic rules which govern the construction of syllables and their allowed sequences. These constraints are not as restrictive as those learned from word sequences, but they can provide additional useful linguistic information. Hence, it is possible to improve speech recognition performance by appropriately combining these two types of constraints. For the Chinese language considered in this paper, character level language models (LMs) can be used as a first level approximation to allowed syllable sequences. To test this idea, word and character level n-gram LMs were trained on 2.8 billion words (equivalent to 4.3 billion characters) of texts from a wide collection of text sources. Both hypothesis and model based combination techniques were investigated to combine word and character level LMs. Significant character error rate reductions up to 7.3% relative were obtained on a state-of-the-art Mandarin Chinese broadcast audio recognition task using an adapted history dependent multi-level LM that performs a log-linearly combination of character and word level LMs. This supports the hypothesis that character or syllable sequence models are useful for improving Mandarin speech recognition performance.