The relation between emotions and music is substantial because music as an art can evoke emotions. Music emotion recognition (MER) studies the emotions that music brings in the effort to map musical features to the affective dimensions. This study conceptualizes the mapping of music and emotion as a multivariate time series regression problem, with the aim of capturing the emotion flow in the Arousal-Valence emotional space. The Efficient Net-Music Informer (ENMI) Network was introduced to address this phenomenon. The ENMI was used to extract Mel-spectrogram features, complementing the time series data. Moreover, the Music Informer model was adopted to train on both time series music features and Mel-spectrogram features to predict emotional sequences. In our regression task, the model achieved a root mean square error (RMSE) of 0.0440 and 0.0352 in the arousal and valence dimensions, respectively, in the DEAM dataset. A comprehensive analysis of the effects of different hyperparameters tuning was conducted. Furthermore, different sequence lengths were predicted for the regression accuracy of the ENMI Network on three different datasets, namely the DEAM dataset, the Emomusic dataset, and the augmented Emomusic dataset. Additionally, a feature ablation on the Mel-spectrogram features and an analysis of the importance of the various musical features in the regression results were performed, establishing the effectiveness of the model presented herein.