• An end-to-end learning structure for the Stereoscopic Video Quality Assessment with three pipelines to extract the quality degradation from spatial, temporal, and disparity information. • A non-homogeneous data augmentation system that splits the stereoscopic video into small cube patches and removes the outliers from the created dataset using entropy criteria. • A robust deep depth quality-aware feature extractor used to extract motion and depth maps jointly and feed them synchronously to the deep network to extract quality discriminatory features. • Two popular datasets are used to validate the proposed learning architecture for Stereoscopic Video Quality Assessment and received state-of-the-art results in both datasets. Convolutional Neural Networks (CNNs) have achieved great success in learning computer vision tasks, particularly 3D CNNs, for extracting spatio-temporal features from the given videos. However, 3D CNNs have not been well-examined for the Stereoscopic Video Quality Assessment (SVQA). To our best knowledge, most of the state-of-the-art methods used the traditional hand-crafted feature extraction methods for the SVQA. Very few methods used the power of deep learning for SVQA, and they just considered the spatial information, ignoring the disparity and motion information. In this paper, we propose a No-Reference (NR) deep 3D CNN architecture that jointly focuses on spatial, disparity, and temporal information between consecutive frames. A 3-Stream 3D CNN, shortly 3S-3DCNN , by performing 3D CNNs, extracts features from spatial, motion, and depth channels to estimate the stereo video’s quality. It captures the degradations in the quality of the stereoscopic video in multiple dimensions. Firstly, the scene flow, which is the joint prediction of the optical flow and stereo disparity, is calculated. Then, the spatial information, optical flow, and disparity map of a given video are used as input to the 3S-3DCNN model. The extracted features are concatenated and utilized as inputs to the fully connected layers for doing the regression. We split the input videos into cube patches for data augmentation and remove the cubes that confuse our model from the training and testing sets. Two standard stereoscopic video quality assessment benchmarks of LFOVIAS3DPh2 and NAMA3DS1-COSPAD1 were used to evaluate our method. Experimental results show that our 3S-3DCNN method’s objective score significantly correlates with the subjective SVQ scores in multiple video datasets. The RMSE for NAMA3DS1-COSPAD1 dataset is 0.2757, which outperforms other methods by a large margin. The SROCC value for the blur distortion of the LFOVIAS3DPh2 dataset is more than 98%, indicating that the 3S-3DCNN is consistent with human visual perception.