Video segmentation has a major role in several applications like autonomous vehicles, medical image analysis, video surveillance, and augmented reality. Portrait segmentation, a subset of semantic image segmentation, is the key step of preprocessing step in numerous applications and a few among them are entertainment applications, security systems, video conferences, and so on. For the given video, every object exhibiting independent motion in at least a single frame is segmented. This is formulated as a learning problem and designed as an Ensemble Soft Voting Algorithm based on Supervised Machine Learning (ESVA-SML). The motion stream requires an ensemble learning method trained on synthetic videos for segmenting independent objects at motion in the optical flow field. The spatial–temporal features are extracted from each sub-bands which efficiently resembles the human visual system characteristics concerning the feature extract using Gray Level Size Zone Matrix (GLSZM). This method linear combines the previous vectors within a stipulated time interval thereby describing each spatial–temporal feature vector. Based on the prediction errors, shot boundaries of every shot are efficiently identified from which at least a single keyframe is extracted for further analysis. The proposed method is extensively evaluated using SegTrack-v2, DAVIS, and Fusion databases by comparing eight standard methods in various parameters. By analyzing all these three databases, SegTrack-v2 is best in terms of accuracy, precision, and recall. While both DAVIS and Fusion databases produce less MAE. Finally, Fusion produces less RMSE and RAE with minimal response time.
Read full abstract