Articles published on Music generation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
508 Search results
Sort by Recency
- New
- Research Article
- 10.55401/rxxkst95
- Dec 8, 2025
- Journal of Science and Technology
- Quang Minh Trinh + 3 more
A text-to-music generator is an artificial intelligence system that composes songs from user-provided text prompts by leveraging large datasets for training. This research explores the theoretical foundations linking language and music through semantic, emotional, and structural analysis, and demonstrates practical integration of AI music generation into software via APIs. To illustrate, simulated Python code examples are provided using a fictional Suno AI API, alongside references to platforms such as Boomy, AIVA, Amper Music, and OpenAI Jukebox. These integrations highlight how developers and businesses can embed automated music creation into applications for education, entertainment, therapy, and digital media, thereby advancing interdisciplinary research and software innovation.
- New
- Research Article
- 10.1016/j.sasc.2025.200221
- Dec 1, 2025
- Systems and Soft Computing
- Weina Yu
The construction of improved GCA multi-style music generation model for music intelligent teaching classroom
- New
- Research Article
- 10.1016/j.engappai.2025.112131
- Dec 1, 2025
- Engineering Applications of Artificial Intelligence
- Fangzhu Jin + 2 more
A theme music generation model based on hybrid variational autoencoders and conditional generative adversarial networks
- Research Article
- 10.54254/2753-7064/2025.ns29129
- Nov 5, 2025
- Communications in Humanities Research
- Jiayi Zhang
The intersection of artificial intelligence and music has developed rapidly in recent years, driven by advances in deep learning and the increasing availability of multimodal datasets. This review surveys recent progress in music understanding and generation with Artificial Intelligence (AI) through Large Language Models (LLMs) along three lines: agent/controller systems (e.g., AudioGPT, MusicAgent, CoComposer), multimodal fusion/decoders (MuMu-LLaMA, DeepResonance), and symbolic score models (ChatMusician, MuseCoco). The paper summarizes what each does well, such as tool-orchestrated workflows and cross-modal alignment for text-/image-/video-to-music. Next, the paper introduces the common, fixable challenges occurring in each category, such as limited long-form coherence, uneven controllability, and dataset bias. Some key challenges include orchestration reliability and dependence on pretrained decoders. The paper then proposes some short term remedies such as including multi-agent planning, longer-context modelling, and broader, well-labeled pretraining data, before predicting that the field of music based Large Language Models is moving toward hybrid systems that will integrate themselves within real workflows in the near future. This mapping creates a concise summary of advances in musical understanding of Large Language Models.
- Research Article
- 10.47772/ijriss.2025.925ileiid000045
- Nov 5, 2025
- International Journal of Research and Innovation in Social Science
- Juriani Jamaludin + 4 more
This study explores the integration of artificial intelligence (AI) in music education as an innovative approach to enhance language learning through song lyrics. Music is widely recognised as a powerful pedagogical tool, with its lyrics offering meaningful context for vocabulary building, grammar practice, and expressive communication. By combining AI with music education, the study aims to create engaging and multidisciplinary learning experiences that foster both linguistic growth and artistic expression. The approach emphasises lyric writing, songwriting, and guided singing to improve students’ vocabulary, grammar, pronunciation, and vocal skills. Learners compose original works with the help of AI tools while practising accurate English vowels, clear pronunciation, and appropriate vocal techniques. The methodology involves AI-assisted music generation, lyric composition, guided vocal practice, and peer presentations, encouraging collaboration, creativity, and reflective learning. Findings show that students achieve notable improvements in language proficiency, confidence, and creative expression. Importantly, the accessibility of AI allows participation from all learners, regardless of prior musical training, making songwriting and performance both inclusive and enjoyable. The novelty of this study lies in its cross-disciplinary framework that unites music education, language acquisition, vocal training, and AI technology. Results highlight how AI-supported lyric composition and guided singing enhance linguistic skills while nurturing creativity, critical thinking, and student engagement. This approach demonstrates the transformative potential of AI in education, offering new pathways where music, language, and technology converge to enrich teaching and learning.
- Research Article
- 10.1177/14727978251391323
- Oct 29, 2025
- Journal of Computational Methods in Sciences and Engineering
- Pengcheng Xiao
Real-time collaborative music creation requires dynamic systems that can understand musical sequences, tonal structure, and rhythmic flow in live contexts. Conventional sequence generation methods using static deep learning models often struggle with adapting to changing musical input. The limited use of audio features and non-optimized architectures leads to reduced fluency and stylistic coherence in generated improvisations. To enable adaptive and context-sensitive real-time music improvisation that responds fluidly to symbolic and audio-based inputs. A Biogeography-Based optimizer-driven stacked Long Short-Term Memory (BB-Stacked LSTM) is introduced, combining evolutionary optimization with temporal deep learning to improve improvisation quality and model adaptability. The BB-Stacked LSTM system uses evolutionary principles to optimize sequence modeling parameters, enhancing both accuracy and expressiveness in music generation. Performance-oriented datasets featuring paired audio and symbolic data are utilized, including genres such as jazz and classical. One-hot encoding is used for symbolic note sequences. Sequence smoothing is achieved through a Hidden Markov Model. Time-aligned symbolic and audio data are structured for temporal modeling. Mel-Frequency Cepstral Coefficients (MFCC) are extracted from audio to capture spectral and timbral properties. The Stacked LSTM learns sequence progression, while BBO tunes architectural parameters, including layer depth, unit count, and learning rate to maximize musical coherence. Generated sequences exhibit an improved armony score of 4.6. The BB-Stacked LSTM approach enhances real-time music generation by integrating evolutionary optimization with deep temporal modeling.
- Research Article
- 10.1038/s41598-025-20179-3
- Oct 16, 2025
- Scientific Reports
- Chongbin Zhang + 2 more
Music source separation, as a fundamental task in intelligent audio processing, plays a critical role in enhancing the performance of music generation, editing, and understanding systems. However, existing separation models often suffer from structural limitations such as reliance on a single modeling path, entangled time-frequency representations, and difficulty in adapting to heterogeneous sound sources. Furthermore, they struggle to maintain an effective balance between long-range dependency modeling and inference efficiency. To address these challenges, this paper proposes a novel dual-path state space modeling architecture, MSNet. By introducing decoupled modeling mechanisms for temporal and frequency pathways, and incorporating Mamba-based state space units for multidimensional structural parsing of audio signals, MSNet enhances selective control and structural representation in time-frequency modeling. Experimental results demonstrate that MSNet achieves state-of-the-art performance on the MUSDB18 dataset across multiple evaluation metrics. In particular, it shows superior robustness and stability when dealing with dynamically complex sources such as vocals and drums. Additionally, the model achieves a real-time factor (RTF) below 0.1 while maintaining superior separation quality, making it suitable for deployment in practical applications. This study not only demonstrates the feasibility of state space models for complex audio modeling but also introduces a new architectural paradigm for music source separation that balances accuracy and efficiency. The implementation is publicly available at: https://github.com/NMLAB8/Mamba-S-Net.
- Research Article
- 10.1177/14727978251385149
- Oct 16, 2025
- Journal of Computational Methods in Sciences and Engineering
- Xi Wang + 2 more
With the development of artificial intelligence technology, automatic choreography, as an emerging cross-disciplinary field, has received more and more attention in its research and application. To realize automatic choreography as well as the storage of dance curriculum resources, the study first introduces the bidirectional long short-term memory model, and combines the attention mechanism for music generation. The attention mechanism enables the model to focus more on important feature information by assigning different weights to different time steps in the model, thereby better understanding the overall information of the music sequence and improving the performance of the model. Then, the Openpose model is utilized for human pose estimation and the sequence-to-sequence model is used to generate dance movements matching the music. The experimental results show that the Att-BiLSTM model outperforms the traditional model in terms of accuracy, recall rate, precision value and F1 value. Compared with the traditional LSTM model, the accuracy of the Att-BiLSTM model has increased from 85.2% to 94.9%, the recall rate from 79.1% to 90.3%, the precision value from 87.5% to 95.8%, and the F1 value from 84.7% to 94.7%. The performance has improved significantly. It reflects the significant improvement effect of the attention mechanism on the performance of the BiLSTM model. In terms of human pose estimation, the Openpose model keypoint detection accuracy and partial affinity field prediction effect reached 0.927 and 0.854, respectively, and the frame rate reached 15FPS. The Seq2Seq network was located at the highest level in terms of movement flow, naturalness, and synchronization scores in dance movement generation. The movement coherence index and music rhythm matching were 0.92 and 0.95, respectively. The results demonstrate that the network model proposed in the study has significant advantages in terms of coherence, naturalness, and synchronization with music in movement generation. This is of great practical significance for promoting the development of automatic choreography technology and the innovation of dance education.
- Research Article
- 10.1177/14727978251385201
- Oct 14, 2025
- Journal of Computational Methods in Sciences and Engineering
- Sha Li
With the rapid development of computer information technology in China, algorithmic composition technology has received more attention in the field of musical composition. However, the accuracy of existing methods based on artificial intelligence algorithms for composition is relatively low, and the production effect cannot meet practical requirements when facing complex tracks. In view of this, this research designed the music element automatic generation method based on recurrent neural network. A music element automatic generation model based on resonant neural network is proposed. The improved algorithm is experimentally validated. The experiment showed that the system combined with the average field connection network, initial universal connection of resonant neural network, and detuned oscillator performed the best. The F-value reached 77.2%. The chord generation accuracy of the LSTM-RNN model was 81.99%, 81.65%, 81.02%, and 80.47%, respectively. The designed method can effectively achieve music production, meet high precision design requirements, and achieve good design results. This indicates that the music element generation method based on recurrent gradient frequency proposed in the study has good performance. It can accurately generate music elements, providing certain assistance and reference for the development of automatic generation technology of music elements in China. It is recommended to apply this method to more diverse scenarios in the future to complete music element generation.
- Research Article
- 10.1007/s40745-025-00643-7
- Oct 12, 2025
- Annals of Data Science
- Pengzhan Qin
Automatic Music Generation with Multi-module Neural Networks for Chord, Rhythm, and Pitch Modeling
- Research Article
- 10.1038/s41598-025-19348-1
- Oct 9, 2025
- Scientific Reports
- Yuting Ni
Existing music score generation methods are Limited by the scene Limitations and the quality of their generated scores is relatively Limited. To address these problems, an intelligent music score generation method combining short-time Fourier transform and improved convolutional neural network is proposed. The study firstly utilizes short-time Fourier transform to transform the time-frequency of music signals, and then inputs the transformed time-frequency information into an improved convolutional neural network model. The model improves the accuracy and diversity of music score generation by introducing label enhancement strategy and internal convolution structure. The method may effectively increase the quality of music score creation on various music datasets with strong generalization ability, according to the experimental results. The matching rate and complete rate of the generated score of the proposed method were 92% and 95%, respectively, and its score generation time was only 1.05s. The proposed method could improve the efficiency and quality of the music score generation. The intelligent music score generation method can help the drum learners understand their own performance in time, and give feedback on their training to improve the learning efficiency.
- Research Article
- 10.5815/ijitcs.2025.05.02
- Oct 8, 2025
- International Journal of Information Technology and Computer Science
- Vijayan R + 3 more
The rise of virtual instruments has revolutionized music production, providing new avenues for creating music without the need for physical instruments. However, these systems rely on costly hardware, such as MIDI controllers, limiting accessibility. As an alternative, 3D gesture-based virtual instruments have been explored to emulate the immersive experience of MIDI controllers. Yet, these approaches introduce accessibility challenges by requiring specialized hardware, such as depth-sensing cameras and motion sensors. In contrast, 2D gesture systems using RGB cameras are more affordable but often lack extended functionalities. To address these challenges, this study presents a 2D virtual piano system that utilizes hand gesture recognition. The system enables accurate gesture-based control, real-time volume adjustments, control over multiple octaves and instruments, and automatic sheet music generation. OpenCV, an open-source computer vision library, and Google’s MediaPipe are employed for real-time hand tracking. The extracted hand landmark coordinates are normalized based on the wrist and scaled for consistent performance across various RGB camera setups. A bidirectional long short-term memory (Bi-LSTM) network is used to evaluate the approach. Experimental results show 95% accuracy on a public Kaggle dynamic gesture dataset and 97% on a custom-designed dataset for virtual piano gestures. Future work will focus on integrating the system with Digital Audio Workstations (DAWs), adding advanced musical features, and improving scalability for multiple-player use.
- Research Article
- 10.1016/j.eswa.2025.130059
- Oct 1, 2025
- Expert Systems with Applications
- Jing Luo + 2 more
BandCondiNet: Parallel Transformers-based Conditional Popular Music Generation with Multi-View Features
- Research Article
- 10.1016/j.aej.2025.05.053
- Oct 1, 2025
- Alexandria Engineering Journal
- Lili Liu + 2 more
MusDiff: A multimodal-guided framework for music generation
- Research Article
- 10.1142/s021812662550478x
- Sep 30, 2025
- Journal of Circuits, Systems and Computers
- Lan Sha
Human–computer interaction and computer vision have witnessed great development in recent years, and their applications in various fields are becoming increasingly mature. With the improvement of computer processing ability, their importance in music creation, editing, recording and other aspects has become increasingly prominent. Music production space is the core of music creation, and its constituent elements are also related to the quality and timbre of music. At present, many musicians or producers lack the guidance of advanced technical means and the application of relevant algorithms in the analysis of the elements of music production space, which ultimately leads to low efficiency of music production and low music quality. This paper analyzed the aspects of music production space using a Recurrent Neural Network (RNN). First, it summarized the elements of music space from three aspects: melody, voice structure and timbre. Based on the mechanism of music space composition, it proposed a melody generation method based on a cyclic neural network and a chord music generation method based on a Grouping Combining Algorithm (GCA) are proposed. Finally, the specific application of this method was tested by experiments. The experimental results showed that the new method can improve the chord generation accuracy by 6.02% and also can promote the music production efficiency.
- Research Article
- 10.1080/09298215.2025.2540434
- Sep 10, 2025
- Journal of New Music Research
- Satoshi Nishimura + 1 more
Pytakt is a new Python library for the text-based description, algorithmic generation, and real-time processing of symbolic (event-level) music information. The library provides embedded textual description that makes it possible to represent scores containing chords, polyphony, and performance information, such as velocity or control changes, in a compact form. Scores can be concatenated, merged, or repeated with operators, and various score transformations, such as diatonic transposition and pattern replacement, are available. Pytakt also has real-time MIDI input/output functions with a priority queue, which are useful in interactive music applications, together with music-theoretic classes such as scales and chords. In addition, it incorporates basic analytic features, including the retrieval of active notes and controllers at any given time with a novel algorithm and a multi-track visualiser with a playback function. This paper introduces the design and features of Pytakt and presents its usage examples, including procedural music description and data preparation for machine learning. Furthermore, the results of performance comparison with other libraries are shown to confirm the lightweight property of our library.
- Research Article
- 10.1145/3749461
- Sep 3, 2025
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
- Sonyun Tao + 6 more
Physical Therapy (PT) is crucial for recovery from acute injuries and plays a vital role in promoting functional ability. However, at-home PT is often associated with boredom, lack of engagement, and difficulties in receiving immediate feedback about exercise performance, all of which contribute to poorer PT outcomes. This paper introduces an approach that leverages the engaging, therapeutic power of music to provide intuitive, real-time feedback and adaptive guidance during PT. Developed through a user-centered, research-through-design process, our "MusicalPT" system tracks and sonifies limb movements using computer vision and music generation techniques. We present a low-cost, lightweight, and robust tracking setup that can be used at home with a webcam and a computing device. Our lab-based evaluation shows that this musical guidance helps people perform exercises correctly and remain engaged, while promoting enjoyment and other dimensions of positive user experience. Based on these findings, we discuss broader design directions for interactive musical technologies in the context of health management.
- Research Article
- 10.3390/e27090901
- Aug 25, 2025
- Entropy
- Yang Li
Recently, music generation models based on deep learning have made remarkable progress in the field of symbolic music generation. However, the existing methods often have problems of violating musical rules, especially since the control of harmonic structure is relatively weak. To address these limitations, this paper proposes a novel framework, the Entropy-Regularized Latent Diffusion for Harmony-Constrained (ERLD-HC), which combines a variational autoencoder (VAE) and latent diffusion models with an entropy-regularized conditional random field (CRF). Our model first encodes symbolic music into latent representations through VAE, and then introduces the entropy-based CRF module into the cross-attention layer of UNet during the diffusion process, achieving harmonic conditioning. The proposed model balances two key limitations in symbolic music generation: the lack of theoretical correctness of pure algorithm-driven methods and the lack of flexibility of rule-based methods. In particular, the CRF module learns classic harmony rules through learnable feature functions, significantly improving the harmony quality of the generated Musical Instrument Digital Interface (MIDI). Experiments on the Lakh MIDI dataset show that compared with the baseline VAE+Diffusion, the violation rates of harmony rules of the ERLD-HC model under self-generated and controlled inputs have decreased by 2.35% and 1.4% respectively. Meanwhile, the MIDI generated by the model maintains a high degree of melodic naturalness. Importantly, the harmonic guidance in ERLD-HC is derived from an internal CRF inference module, which enforces consistency with music-theoretic priors. While this does not yet provide direct external chord conditioning, it introduces a form of learned harmonic controllability that balances flexibility and theoretical rigor.
- Research Article
- 10.1142/s0129156425408149
- Aug 8, 2025
- International Journal of High Speed Electronics and Systems
- Tie Ru
This study proposes a deep learning-based automatic piano note recognition and performance generation system, which aims to enhance the accuracy and efficiency of piano music transcription and synthesis. Traditional methods for piano note recognition often rely on heuristic algorithms and handcrafted features, which struggle with complex polyphonic music and varying acoustic conditions. To address these limitations, we introduce an end-to-end deep learning framework that integrates convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract temporal and spectral features from piano audio recordings. The system is further enhanced with an attention mechanism to improve the differentiation of overlapping notes. A generative model is incorporated to synthesize expressive piano performances based on the recognized notes, ensuring a natural and human-like playing style. The proposed model is trained on large-scale piano performance datasets to enhance generalization across different playing styles and recording conditions. Furthermore, a reinforcement learning-based optimization strategy is introduced to refine the model’s performance in real-time applications. To improve robustness, the system integrates data augmentation techniques and adversarial training to mitigate errors caused by noise and variations in recording environments. Experimental results demonstrate that the proposed system achieves superior note recognition accuracy and generates high-quality piano performances compared to traditional approaches. These findings highlight the potential of deep learning in advancing automatic music transcription and synthesis technologies, paving the way for more interactive and intelligent music applications, such as real-time accompaniment systems, automatic music composition, and digital sheet music generation. This work contributes to bridging the gap between artificial intelligence and musical creativity, offering novel possibilities for both professional musicians and music enthusiasts.
- Research Article
- 10.5965/2525530410012025e0104
- Aug 5, 2025
- Orfeu
- Ivan Simurra + 1 more
In the 1950s, Americans Hiller and Isaacson pioneered computer-generated music: “Iliac Suit”. Despite advancements in artificial intelligence (AI) systems, current music generation through machines still employs the paradigm established by Hiller and Isaacson (STEELS, 2021). Concurrently, emergent research on human-machine co-creation is reshaping the creative industries, enabling computers to contribute to music, art, and cultural production in ways that were previously unimaginable. Computers now create (make) music, art, and culture with potential for consumption (COMITÊ GESTOR DA INTERNET NO BRASIL, 2022). These transformations may alter music in epistemological and even ontological terms, restructuring the role of the composer. Computer science researchers and those from other technology fields have sought foundations in the humanities, particularly in art-based research, to underpin their studies (CARAMIAUX; DONNARUMMA, 2021). In this regard, there is an urgent need for research stemming from the academic field of music to establish a balanced dialogue with the field of computer science and other technologies. It is worth noting that in a context where music generation and AI projects are funded by professional software-producing companies, with economic investment driven by “usability” (RUTZ, 2021), it is arduous for the music field to conduct practical research on collaborative music generation between humans and machines since most current systems are not available for free experimentation. Therefore, this work aims to discuss the possibilities and challenges faced by researchers in the music field to conduct practical research on human-machine collaboration for music generation. This discussion has proven to be crucial from the challenges found in investigating whether collaborations between composers and AI music generation systems can preserve Brazilian cultural elements in musical outputs. Moreover it is vital as it addresses both the methodological barriers and the broader implications of integrating AI with cultural and creative expressions. The research aims not only to assess the feasibility of such collaborations but also to explore their potential to expand creative and cultural boundaries.