Music generation by AI algorithms like Transformer is currently a research hotspot. Existing methods often suffer from issues related to coherence and high computational costs. To address these problems, we propose a novel Transformer-based model that incorporates a gate recurrent unit with root mean square norm restriction (TARREAN). This model improves the temporal coherence of music by utilizing the gate recurrent unit (GRU), which enhances the model's ability to capture the dependencies between sequential elements. Additionally, we apply masked multi-head attention to prevent the model from accessing future information during training, preserving the causal structure of music sequences. To reduce computational overhead, we introduce root mean square layer normalization (RMS Norm), which smooths gradients and simplifies the calculations, thereby improving training efficiency. The music sequences are encoded using a compound word method, converting them into discrete symbol-event combinations for input into the TARREAN model. The proposed method effectively mitigates discontinuity issues in generated music and enhances generation quality. We evaluated the model using the Essen Associative Code and Folk Song Database, which contains 20,000 folk melodies from Germany, Poland, and China. The results show that our model produces music that is more aligned with human preferences, as indicated by subjective evaluation scores. The TARREAN model achieved a satisfaction score of 4.34, significantly higher than the 3.79 score of the Transformer-XL + REMI model. Objective evaluation also demonstrated a 15% improvement in temporal coherence compared to traditional methods. Both objective and subjective experimental results demonstrate that TARREAN can significantly improve generation coherence and reduce computational costs.
Read full abstract