Learning Visual Representations Research Articles

In recent years, self-supervised learning has emerged as a powerful approach to learning visual representations without requiring extensive manual annotation. One popular technique involves using rotation transformations of images, which provide a clear visual signal for learning semantic representation. However, in this work, we revisit the pretext task of predicting image rotation in self-supervised learning and discover that it tends to marginalise the perception of features located near the centre of an image. To address this limitation, we propose a new self-supervised learning method, namely FullRot, which spotlights underrated regions by resizing the randomly selected and cropped regions of images. Moreover, FullRot increases the complexity of the rotation pretext task by applying the degree-free rotation to the region cropped into a circle. To encourage models to learn from different general parts of an image, we introduce a new data mixture technique called WRMix, which merges two random intra-image patches. By combining these innovative crop and rotation methods with the data mixture scheme, our approach, FullRot + WRMix, surpasses the state-of-the-art self-supervision methods in classification, segmentation, and object detection tasks on ten benchmark datasets with an improvement of up to +13.98% accuracy on STL-10, +8.56% accuracy on CIFAR-10, +10.20% accuracy on Sports-100, +15.86% accuracy on Mammals-45, +15.15% accuracy on PAD-UFES-20, +32.44% mIoU on VOC 2012, +7.62% mIoU on ISIC 2018, +9.70% mIoU on FloodArea, +25.16% AP50 on VOC 2007, and +58.69% AP50 on UTDAC 2020. The code is available at https://github.com/anthonyweidai/FullRot_WRMix.

Read full abstract

The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications likepedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others.In this dissertation we are interested in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RGB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition.First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on Graph cuts optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts.At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body. In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually limiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets.Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences. A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains.

Read full abstract

Learning Visual Representations Research Articles

Related Topics

Articles published on Learning Visual Representations

Weak Augmentation Guided Relational Self-Supervised Learning.

Any region can be perceived equally and effectively on rotation pretext task using full rotation and weighted-region mixture

Hyperbolic Deep Learning in Computer Vision: A Survey

MixIR: Mixing Input and Representations for Contrastive Learning.

Global semantic enhancement network for video captioning

Cross Modal Video Representations for Weakly Supervised Active Speaker Localization

POPAR: Patch Order Prediction and Appearance Recovery for Self-supervised Medical Image Analysis.

Retaining Diverse Information in Contrastive Learning Through Multiple Projectors

OCEAN: Object-centric arranging network for self-supervised visual representations learning

Learning visual representations with optimum-path forest and its applications to Barrett’s esophagus and adenocarcinoma diagnosis

Learning visual and textual representations for multimodal matching and classification

Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation.

From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

Uncertainty in scene segmentation: Statistically optimal effects on learning visual representations

Learning visual representations for perception-action systems

Learning visual representations with projection pursuit

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Learning Visual Representations Research Articles

Related Topics

Articles published on Learning Visual Representations

Weak Augmentation Guided Relational Self-Supervised Learning.

Any region can be perceived equally and effectively on rotation pretext task using full rotation and weighted-region mixture

Hyperbolic Deep Learning in Computer Vision: A Survey

MixIR: Mixing Input and Representations for Contrastive Learning.

Global semantic enhancement network for video captioning

Cross Modal Video Representations for Weakly Supervised Active Speaker Localization

POPAR: Patch Order Prediction and Appearance Recovery for Self-supervised Medical Image Analysis.

Retaining Diverse Information in Contrastive Learning Through Multiple Projectors

OCEAN: Object-centric arranging network for self-supervised visual representations learning

Learning visual representations with optimum-path forest and its applications to Barrett’s esophagus and adenocarcinoma diagnosis

Learning visual and textual representations for multimodal matching and classification

Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation.

From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

Uncertainty in scene segmentation: Statistically optimal effects on learning visual representations

Learning visual representations for perception-action systems

Learning visual representations with projection pursuit