Implementation of the LSTM Model for Speech-to-Text Systems in the Recognition of the Walikan Language of Malang
This study developed a Speech-to-Text (STT) system based on the Long Short-Term Memory (LSTM) model to recognize and convert speech in the Malang Walikan language into text. The Malang Walikan language has a unique linguistic structure in the form of word reversal, which poses a challenge in speech recognition. The data used consisted of 1,000 sentences collected from social media and direct recordings. The data was processed using Mel Frequency Cepstral Coefficients (MFCC) and then used to train the LSTM model.The system's performance was evaluated using the Word Error Rate (WER), Character Error Rate (CER), and Average Test Loss metrics. The best results obtained showed a WER value of 1.0 on a 699:300 data split, a CER of 0.78 on a 799:200 split, and an Average Test Loss of 11.0147 on a 299:700 split.The high Average Test Loss value indicates the model's difficulty in minimizing prediction errors, which may be caused by the model's mismatch with the data patterns or overfitting. To improve the model's performance, it is recommended to improve the quality of the training data, optimize the parameters, and apply regularization techniques.
- Research Article
5
- 10.3390/app142210498
- Nov 14, 2024
- Applied Sciences
Arabic raw audio datasets were initially gathered to produce a corresponding signal spectrum, which was further used to extract the Mel-Frequency Cepstral Coefficients (MFCCs). The pronunciation dictionary, language model, and acoustic model were further derived from the MFCCs’ features. These output data were processed into Baidu’s Deep Speech model (ASR system) to attain the text corpus. Baidu’s Deep Speech model was implemented to precisely identify the global optimal value rapidly while preserving a low word and character discrepancy rate by attaining an excellent performance in isolated and end-to-end speech recognition. The desired outcome in this work is to forecast the next word and character in a sequential and systematic order that applies under natural language processing (NLP). This work combines the trained Arabic language model ARABERT with the potential of Long Short-Term Memory (LSTM) networks to predict the next word and character in an Arabic text. We used the pre-trained ARABERT embedding to improve the model’s capacity and, to capture semantic relationships within the language, we educated LSTM + CNN and Markov models on Arabic text data to assess the efficacy of this model. Python libraries such as TensorFlow, Pickle, Keras, and NumPy were used to effectively design our development model. We extensively assessed the model’s performance using new Arabic text, focusing on evaluation metrics like accuracy, word error rate, character error rate, BLEU score, and perplexity. The results show how well the combined LSTM + ARABERT and Markov models have outperformed the baseline models in envisaging the next word or character in the Arabic text. The accuracy rates of 64.9% for LSTM, 74.6% for ARABERT + LSTM, and 78% for Markov chain models were achieved in predicting the next word, and the accuracy rates of 72% for LSTM, 72.22% for LSTM + CNN, and 73% for ARABERET + LSTM models were achieved for the next-character prediction. This work unveils a novelty in Arabic natural language processing tasks, estimating a potential future expansion in deriving a precise next-word and next-character forecasting, which can be an efficient utility for text generation and machine translation applications.
- Conference Article
1
- 10.1109/icimcis60089.2023.10349042
- Nov 7, 2023
Automatic Speech Recognition (ASR) is useful for converting speech into text. ASR is needed to display automatic subtitles on movies or when conducting video conferencing. The use of deep learning in ASR applications is currently still dominated by Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN). Several previous studies used the Transformer deep learning model in building the ASR model. The accuracy results obtained are better than the LSTM or RNN models. However, ASR research using the Transformer model is still very limited. This paper will discuss the use of the Transformer model for ASR in Indonesian. The dataset used is the Indonesian language private speech dataset. The experimental results show that the Word Error Rate (WER) and Character Error Rate (CER) produced by the proposed model using the Indonesian language primary dataset are almost the same as using the English language public dataset where the resulting WER and CER are 27.34 and 7.96 for Indonesian and 25.28 and 6.02 for English.
- Research Article
4
- 10.11113/aej.v13.19648
- Oct 24, 2023
- ASEAN Engineering Journal
Today, fake information has become a significant problem, exacerbated by the acceleration of access to information. The spread of fake information has a dangerous impact, especially regarding global health issues, for example COVID-19. People can access various resources to obtain information, including online sites and social media. One of the methods to control the spread of false information is detecting hoaxes. Many methods have been developed to identify hoaxes; most previous studies have focused on developing hoax detection methods using data from a single source in English. The present study is carried out to detect fake news in Indonesian language using multiple data sources, including traditional and social media in the context of COVID-19. The study uses Long Short-Term Memory (LSTM) and the Robustly Optimised Bidirectional Encoder Representations from Transformers Pre-Training Approach (RoBERTa). The LSTM approach is used to develop four different architectures that varied based on: (1) the use of text-only versus the use of both title and text; (2) the number of LSTM and dense layers; and (3) the activation function. The LSTM model with text-only data, a single LSTM layer and two dense layers, outperformed other LSTM architectures, achieving the highest accuracy of 92.17%. The LSTM models require a considerably short training time of 23–27 minutes for 3,847 articles and has a detection time of 3.8–4.1 ms per article. The RoBERTa classifiers outperformed all LSTM models with an accuracy of over 97% and a significantly better training time, with a margin of more than 50% compared to LSTM classifiers, although it had a slightly longer test time. Both LSTM and RoBERTa models outperformed the Naïve Bayes and SVM benchmark methods in terms of accuracy, precision, and recall. Therefore, this study shows that both LSTM and RoBERTa methods are reliable and can be reasonably implemented for real-time fake news detection.
- Conference Article
- 10.2523/iptc-25235-ms
- Jan 13, 2026
Optimizing WAG processes is crucial for maximizing oil recovery and carbon sequestration efficiency in CO2 Enhanced Oil Recovery (EOR) projects. This study presents a comparative evaluation of two deep learning models, Temporal Fusion Transformer (TFT) and Long Short-Term Memory (LSTM), for multivariate forecasting of oil production, CO2 sequestration efficiency, utilization, and retention in WAG scenarios. A comprehensive dataset of digitized monthly production and injection data from six U.S. fields was used to train and validate both models. The results show that the LSTM model outperforms the TFT model in terms of accuracy and reduced errors, with improved predictive capabilities for oil production and CO2 sequestration efficiency. The LSTM model achieved high predictive accuracy, with R2 values reaching 0.99 and robust mean absolute error (MAE) and root mean squared error (RMSE) factors. In contrast, the TFT model achieved R2 values of 0.87 and higher MAE and RMSE factors. Head-to-head accuracy favors the LSTM model, as it achieves higher fit quality and lower errors in the majority of comparisons for both EOR oil and CO2. The size of the margin varies by case, but the direction of the difference is consistent. This outcome, though unexpected due to the advanced nature of TFT, aligns with the learning conditions in this study, where the forecast horizon is short, the historical time series per field are modest in length, and the covariate set is compact. Under such conditions, a compact recurrent architecture like LSTM tends to generalize efficiently and resist variance from operational noise, while a higher-capacity transformer like TFT is more sensitive to data volume and hyperparameters such as encoder length, attention size, learning rate, and dropout. Additionally, LSTM converges more stably with modest tuning, enabling it to reach strong solutions within the available data and horizon. The forecasts generated by the LSTM model enabled actionable short-term operational recommendations, such as adjusting the WAG ratio to optimize CO2 retention and utilization efficiency. For example, adjusting the WAG ratio from 0.9 to 1.9 in the Denver Unit increased CO2 retention by 168% and utilization efficiency by 50%. These optimizations support enhanced carbon abatement by reducing CO2 recycling and improving sequestration permanence. This study demonstrates the potential of deep learning models, particularly LSTM, to optimize WAG processes and improve the efficiency of CO2 EOR operations. The approach offers a scalable, data-driven alternative to simulation workflows, aligning EOR operations with climate and sustainability targets. To the best of our knowledge, this is the first comparative study of LSTM and TFT models for real-field CO2 EOR forecasting and WAG strategy optimization.
- Research Article
- 10.53623/gisa.v5i1.605
- Apr 13, 2025
- Green Intelligent Systems and Applications
Automatic Speech Recognition (ASR) faced challenges in accuracy and noise robustness, particularly in Bahasa Indonesia. This research addressed the limitations of single feature extraction methods, such as Mel-Frequency Cepstral Coefficients (MFCC), which were sensitive to noise, and Relative Spectral Transform - Perceptual Linear Predictive (RASTA-PLP), which was less effective in frequency representation, by proposing a hybrid approach that combined both techniques using Long Short-Term Memory (LSTM) models. MFCC enhanced spectral accuracy, while RASTA-PLP improved noise robustness, resulting in a more adaptive and informative acoustic representation. The evaluation demonstrated that the hybrid method outperformed single and non-extraction approaches, achieving a Character Error Rate (CER) of 0.5245 on clean data and 0.8811 on noisy data, as well as a Word Error Rate (WER) of 0.9229 on clean data and 1.0015 on noisy data. Although the hybrid approach required longer training times and higher memory usage, it remained stable and effective in reducing transcription errors. These findings suggested that the hybrid method was an optimal solution for Indonesian speech recognition in various acoustic conditions.
- Research Article
2
- 10.17485/ijst/v15i29.730
- Aug 5, 2022
- Indian Journal Of Science And Technology
<h2>Abstract</h2> <p><strong>Objectives:</strong> Atmospheric Ozone plays an important role in global climate change, human health and environmental conditions. Accurate and timely prediction of variation in Total Column Ozone (TCO) concentration is very important for both climatology and environment. In this context, the present study aims to utilize the advancement of Artificial Intelligence in Machine learning and a conventional method to develop models for the prediction of TCO concentration using historical time series data over a tropical region, India. <strong>Methods:</strong> Long Short-Term Memory (LSTM) deep learning networks are very useful in classifying, processing and making predictions based on time series data. In this work, Multiple Linear Regression (MLR) model, Long Short-Term Memory (LSTM) models are developed for forecasting TCO over a tropical region. The Statistical Parameters such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) were used to analyze and evaluate the performance of the proposed models. <strong>Findings:</strong> In the present study the TCO concentration varied between 217.4 DU and 288.6 DU. The performances of MLR and LSTM model were compared. From the results, it is evident that the values of RMSE, MAE and MAPE of the MLR model are 3.807924, 2.93766, 1.15 respectively whereas the values for LSTM model are 3.074492, 2.574, 1.01 respectively. LSTM was found to be the accurate model for the study data set and the predicted TCO concentration has a very good correlation with the actual observations and it does not rely on the meteorological parameters. <strong>Novelty:</strong> The LSTM models are usually employed in the field of machine translation, speech recognition and for the prediction of traffic levels, stock levels and air pollutants. This work utilizes the fast numerical machine learning computing library Tensorflow and a high level neural network library Keras that runs on top of Tensorflow to develop a LSTM network model for the prediction of the concentration levels of TCO.</p> <p><strong>Keywords:</strong> Total Column Ozone; Regression; Forecasting; Climate Change; Long Short-Term Memory</p>
- Research Article
84
- 10.3389/fbioe.2020.00063
- Feb 12, 2020
- Frontiers in Bioengineering and Biotechnology
Falls in the elderly is a major public health concern due to its high prevalence, serious consequences and heavy burden on the society. Many falls in older people happen within a very short time, which makes it difficult to predict a fall before it occurs and then to provide protection for the person who is falling. The primary objective of this study was to develop deep neural networks for predicting a fall during its initiation and descending but before the body impacts to the ground so that a safety mechanism can be enabled to prevent fall-related injuries. We divided the falling process into three stages (non-fall, pre-impact fall and fall) and developed deep neutral networks to perform three-class classification. Three deep learning models, convolutional neural network (CNN), long short term memory (LSTM), and a novel hybrid model integrating both convolution and long short term memory (ConvLSTM) were proposed and evaluated on a large public dataset of various falls and activities of daily living (ADL) acquired with wearable inertial sensors (accelerometer and gyroscope). Fivefold cross validation results showed that the hybrid ConvLSTM model had mean sensitivities of 93.15, 93.78, and 96.00% for non-fall, pre-impact fall and fall, respectively, which were higher than both LSTM (except the fall class) and CNN models. ConvLSTM model also showed higher specificities for all three classes (96.59, 94.49, and 98.69%) than LSTM and CNN models. In addition, latency test on a microcontroller unit showed that ConvLSTM model had a short latency of 1.06 ms, which was much lower than LSTM model (3.15 ms) and comparable with CNN model (0.77 ms). High prediction accuracy (especially for pre-impact fall) and low latency on the microboard indicated that the proposed hybrid ConvLSTM model outperformed both LSTM and CNN models. These findings suggest that our proposed novel hybrid ConvLSTM model has great potential to be embedded into wearable inertial sensor-based systems to predict pre-impact fall in real-time so that protective devices could be triggered in time to prevent fall-related injuries for older people.
- Research Article
15
- 10.30630/joiv.7.3-2.2344
- Nov 30, 2023
- JOIV : International Journal on Informatics Visualization
Cryptocurrencies created by Nakamoto in 2009 have gained significant interest due to their potential for high returns. However, the cryptocurrency market's unpredictability makes it challenging to forecast prices accurately. To tackle this issue, a deep learning model has been developed that utilizes Long Short-Term Memory (LSTM) neural networks and Convolutional Neural Networks (CNNs) to predict cryptocurrency prices. LSTMs, a type of recurrent neural network, are well-suited for analyzing time series data and have been successful in various prediction applications. Additionally, CNNs, primarily used for image analysis tasks, can be employed to extract relevant patterns and characteristics from input data in Bitcoin price prediction applications. This study contributes to the existing related works on cryptocurrency price prediction by exploring various predictive models and techniques, which involve a machine learning model, deep learning model, time series analysis, and as well as a hybrid model that combines deep learning methods to predict cryptocurrency prices as well as enhance the accuracy and reliability of the price predictions. To ensure accurate predictions in this study, a trustworthy dataset from investing.com was sought. The dataset, sourced from investing.com, consists of 1826 time series data samples. The dataset covers the time frame from January 1, 2018, to December 31, 2022, providing data for a period of 5 years. Subsequently, pre-processing was conducted on the dataset to guarantee the quality of the input. As a result of absent values and concerns regarding the dataset's obsolescence, an alternative dataset was sourced to avoid these issues. The performance of the LSTM and CNN models was evaluated using root mean squared error (RMSE), mean squared error (MSE), mean absolute error (MAE) and R-squared (R2). It was observed that they outperformed each other to a certain degree in short-term forecasts compared to long-term predictions, where the R2Â values for LSTM range from 0.973 to 0.986, while for CNNs, they range from 0.972 to 0.988 for 1 day, 3 days and 7 days windows length. Nevertheless, the LSTM model demonstrated the most favorable performance with the lowest error rate. The RMSE values for the LSTM model ranged from 1203.97 to 1645.36, whereas the RMSE values for the CNNs model ranged from 1107.77 to 1670.93. As a result, the LSTM model exhibited a lower error rate in RMSE and achieved the highest accuracy in R2Â compared to the CNNs model. Considering these comparative outcomes, the LSTM model can be deemed as the most suitable model for this specific case
- Research Article
27
- 10.1016/j.jhydrol.2023.130076
- Aug 10, 2023
- Journal of Hydrology
Predicting the performance of green stormwater infrastructure using multivariate long short-term memory (LSTM) neural network
- Research Article
15
- 10.3390/make2030014
- Aug 15, 2020
- Machine Learning and Knowledge Extraction
A Long Short Term Memory (LSTM) based sales model has been developed to forecast the global sales of hotel business of Travel Boutique Online Holidays (TBO Holidays). The LSTM model is a multivariate model; input to the model includes several independent variables in addition to a dependent variable, viz., sales from the previous step. One of the input variables, “number of active bookers per day”, is estimated for the same day as sales. This need for estimation requires the development of another LSTM model to predict the number of active bookers per day. The number of active bookers is variable, so the predicted is used as an input to the sales forecasting model. The use of a predicted variable as an input variable to another model increases the chance of uncertainty entering the system. This paper discusses the quantum of variability observed in sales predictions for various uncertainties or noise due to the estimation of the number of active bookers. For the purposes of this study, different noise distributions such as normalized, uniform, and logistic distributions are used, among others. Analyses of predictions demonstrate that the addition of uncertainty to the number of active bookers via dropouts as well as to the lagged sales variables leads to model predictions that are close to the observations. The least squared error between observations and predictions is higher for uncertainties modeled using other distributions (without dropouts) with the worst predictions being for Gumbel noise distribution. Gaussian noise added directly to the weights matrix yields the best results (minimum prediction errors). One possibility of this uncertainty could be that the global minimum of the least squared objective function with respect to the model weight matrix is not reached, and therefore, model parameters are not optimal. The two LSTM models used in series are also used to study the impact of corona virus on global sales. By introducing a new variable called the corona virus impact variable, the LSTM models can predict corona-affected sales within five percent (5%) of the actuals. The research discussed in the paper finds LSTM models to be effective tools that can be used in the travel industry as they are able to successfully model the trends in sales. These tools can be reliably used to simulate various hypothetical scenarios also.
- Conference Article
1
- 10.30632/spwla-2023-0076
- Jun 10, 2023
Precise rock lithology identification from well logs is critical for reservoir characterization and field development. Traditional knowledge-based lithology interpretation is highly dependent on the interpreter’s experience and judgment, which could lead to erroneous decision making or biased prediction. To reduce human involvement and improve interpretation efficiency and consistency, a knowledge-constrained long short-term memory (LSTM) network solution is introduced. In this study, LSTM networks are applied with different constrains to obtain the mapping relations and validate the knowledge-constrained LSTM model accordingly. The entire workflow mainly includes input logging data preprocessing, different constrain validations during the LSTM model training, and validation processes. This study covers and compares the direct LSTM model without constrains, rectangular constrain LSTM (RCLSTM), and Gaussian window weighted constrain LSTM (GWLSTM). In particular, GWLSTM applies a sample cluster as input instead of single sample points. The weight of the sample point is controlled by a distance-correlated Gaussian window, which means the closer to the predicting point, the greater the impact on the prediction. LSTM, RCLSTM, and GWLSTM models are tested on a field data set of five wells in a typical sandstone gas reservoir. Two wells are used to train the network, while the other three wells are used for network assessment. The test results demonstrate that by applying LSTM networks to establish the mapping between the logging curves (e.g., CNL, DT, DEN, GR, and RD) and rock lithology, rock lithologies in target formation can be predicted from well logs. Moreover, the lithology predictions by the GWLSTM model are more accurate than those of conventional LSTM and RCLSTM models, especially for thin layers. In conclusion, GWLSTM networks improve lithology identification accuracy by taking stratigraphic sequences into consideration. And the Gaussian window constrains are more effective than rectangular window constrains for thin layer predictions. Lastly, GWLSTM doesn’t require a large training data set, which makes it advantageous for reservoirs with limited wells.
- Research Article
41
- 10.1016/j.csl.2017.01.013
- Feb 27, 2017
- Computer Speech & Language
Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
- Book Chapter
1
- 10.1007/978-3-030-45183-7_16
- Jan 1, 2020
The long short-term memory (LSTM) model is widely used in multiple areas, mainly for speech recognition, natural language processing and activity recognition. In the last few years, we started to see many variants of LSTM for recurrent neural networks since its inception in 1997. However, there weren’t many studies that have addressed the LSTM’s gating mechanism. In this paper, we propose a novel LSTM framework where we modify the architecture of the LSTM unit by adding a new layer that we call the “outlier gate”. The latter controls the flow of information that goes into the LSTM cell. This added signal allows us to avoid both the carry-over effect that the outliers have on the forecasted point and a bias in the estimates of our LSTM model – caused by unusual or non-repetitive events. The proposed architecture led us to an end-to-end trainable model that we applied in this paper to a financial time-series forecasting problem. Our results demonstrate that the new proposed LSTM architecture achieves better performance than the state-of-the-art original LSTM model.
- Research Article
16
- 10.1016/j.apacoust.2024.110299
- Sep 20, 2024
- Applied Acoustics
Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) are widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrücken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively. To validate the generalization of these findings, we employ the Long Short-Term Memory (LSTM) model as a validation model. Remarkably, the LSTM model also demonstrates improved accuracy of 14.12%, 10.10%, and 6.68% across the datasets when utilizing the optimal combination of parameters. The optimal parameters are validated using an external voice pathology dataset (TACTICAS dataset). The results demonstrate the generalization capabilities of the optimized parameters across various pathologies, machine-learning models, and languages.
- Conference Article
3
- 10.1109/icievicivpr52578.2021.9564239
- Aug 16, 2021
One of the most remarkable advantages of speech recognition (SR) technology includes speech to text conversion (STT). This paper's focal point is to recognize Bangla words among various speakers and noisy environments and convert them into text. Mel frequency cepstral coefficient (MFCC) has been used for extracting words from the audio file. Using gated recurrent units (GRU) along with directional long short term memory (LSTM), the continuity of words and next word-level prediction in a sentence have been implemented. An existing dataset that is enriched by recorded speech has been used here. Keras-Tensorflow toolkit has been used as a software apparatus for training data. The highest train accuracy has been obtained from the GRU architecture in the acoustic model (AM) is 94.44% while the test accuracy rate is only 47% due to the small test dataset. On the other hand, test accuracy is maximum up to 45.81% with the LSTM model in the language model (LM). The proposed method signifies good training and testing accuracy rate based on applying GRU model to the AM, LSTM to the LM and the numbers of total recordings of each sentence.