Abstract
Recent advancements in deep learning have propelled the development of AI systems capable of generating music that resonates with human emotions and preferences. However, current music generation models still struggle to align generated music with detailed textual descriptions and maintain consistency, especially for longer compositions. This paper presents an innovative approach to address these challenges by integrating genre classification and retrieval-augmented generation (RAG) into the music generation pipeline. We train advanced CNN architectures, including ResNet-50, GoogleNet, and VGG16, for accurate genre classification. The classifier is then incorporated into a RAG framework, where the most relevant pre-classified music piece is retrieved based on the input text query. The retrieved audio and the text description are then fed into the MUSICGEN model to generate a new music piece that inherits attributes from both inputs. We evaluate our system through a double-blind human study, comparing the outputs of the original MUSICGEN model with our RAG-enhanced model. The results demonstrate a significant improvement in the ability of the RAG-enhanced model to generate music embodying specific stylistic elements, as evidenced by higher average confidence scores from participants. Our work represents a significant step towards more personalized and context-aware AI-generated musical experiences, laying the foundation for future advancements in this exciting field.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have