In this study, the Multivariate Empirical Mode Decomposition (MEMD) is applied to multichannel EEG to obtain scale-aligned intrinsic mode functions (IMFs) as input features for emotion detection. The IMFs capture local signal variation related to emotion changes. Among the extracted IMFs, the high oscillatory ones are found to be significant for the intended task. The Marginal Hilbert spectrum (MHS) is computed from the selected IMFs. A 3D convolutional neural network (CNN) is adopted to perform emotion detection with spatial-temporal-spectral feature representations that are constructed by stacking the multi-channel MHS over consecutive signal segments. The proposed approach is evaluated on the publicly available DEAP database. On binary classification of valence and arousal level (high versus low), the attained accuracies are 89.25% and 86.23% respectively, which significantly outperform previously reported systems with 2D CNN and/or conventional temporal and spectral features.