Extracting an object of interest from a single video still faces significant difficulties when the object has variegated appearance, manifests articulated motion, or experiences occlusions by other objects. In this paper, we present a video cosegmentation method to address the aforementioned challenges. Departing from the objectness attributes and motion coherence used by traditional foreground–background separation and video segmentation methods, we place central importance in the role of “common fate.” Specifically, the different parts of the object should persist together in all the videos despite the possible presence of incoherent (e.g., articulated) motions. To accomplish this idea, we first extract seed superpixels by a motion-based foreground segmentation method. We next formulate a set of initial to-link constraints between these superpixels based on whether they exhibit the characteristics of common fate. An iterative manifold ranking algorithm is then proposed to trim away the incorrect and accidental linkage relationships. Having discovered the parts that should cohere together, we next perform clustering to extract the entire object and to handle the case in which there might be multiple objects present. This clustering is performed at two levels: the superpixel level and the object level. This two-level clustering algorithm also performs automatic model selection to estimate the number of object classes extracted. Finally, a multiclass labeling Markov random field is used to obtain a refined segmentation result. To evaluate the performance of our framework, we introduce a new data set in which the videos have complex form and motion that are liable to ambiguity in interpretation. Our experimental results on this data set show that our method successfully addresses the challenges in the extraction of complex objects and outperforms the state-of-the-art video segmentation and cosegmentation methods in our data set.