Automatic estimation of emotional state has been of great interest as emotion is an important component in user-oriented interactive technologies. This paper investigates the usage of feed-forward convolutional neural network (CNN) and features extracted from such networks for predicting dimensions of continuous-level emotional states. In this context, a two-stream CNN architecture wherein the video and audio data are learned simultaneously, is proposed. End-to-end mapping of audiovisual data to emotional dimensions reveals that the two-stream network performs better than its single-stream counterpart. The representations learned by the CNNs are refined through a minimum redundancy maximum relevance statistical selection method. Then, the support vector regression applied to selected CNN-based features estimates the instantaneous values of emotional dimensions. The proposed method is trained and tested using the audiovisual conversations of well-known RECOLA and SEMAINE databases. Experimentally it is verified that the regression of the CNN-based features outperforms the traditional audiovisual affective features as well as the end-to-end CNN mapping. Through generalization experiments, it is also observed that the learned representations are robust enough to provide an acceptable prediction performance, when the settings of training and testing datasets are widely different.