- Research Article
4
- 10.1186/s41074-020-00066-8
- Aug 31, 2020
- IPSJ Transactions on Computer Vision and Applications
- Teppei Miura + 1 more
We address a 3D human pose estimation for equirectangular images taken by a wearable omnidirectional camera. The equirectangular image is distorted because the omnidirectional camera is attached closely in front of a person’s neck. Furthermore, some parts of the body are disconnected on the image; for instance, when a hand goes out to an edge of the image, the hand comes in from another edge. The distortion and disconnection of images make 3D pose estimation challenging. To overcome this difficulty, we introduce the location-maps method proposed by Mehta et al.; however, the method was used to estimate 3D human poses only for regular images without distortion and disconnection. We focus on a characteristic of the location-maps that can extend 2D joint locations to 3D positions with respect to 2D-3D consistency without considering kinematic model restrictions and optical properties. In addition, we collect a new dataset that is composed of equirectangular images and synchronized 3D joint positions for training and evaluation. We validate the location-maps’ capability to estimate 3D human poses for distorted and disconnected images. We propose a new location-maps-based model by replacing the backbone network with a state-of-the-art 2D human pose estimation model (HRNet). Our model is a simpler architecture than the reference model proposed by Mehta et al. Nevertheless, our model indicates better performance with respect to accuracy and computation complexity. Finally, we analyze the location-maps method from two perspectives: the map variance and the map scale. Therefore, some location-maps characteristics are revealed that (1) the map variance affects robustness to extend 2D joint locations to 3D positions for the 2D estimation error, and (2) the 3D position accuracy is related to the 2D locations relative accuracy to the map scale.
- Research Article
29
- 10.1186/s41074-020-00065-9
- Aug 31, 2020
- IPSJ Transactions on Computer Vision and Applications
- Takumi Nakane + 5 more
Evolutionary algorithms (EAs) and swarm algorithms (SAs) have shown their usefulness in solving combinatorial and NP-hard optimization problems in various research fields. However, in the field of computer vision, related surveys have not been updated during the last decade. In this study, inspired by the recent development of deep neural networks in computer vision, which embed large-scale optimization problems, we first describe a literature survey conducted to compensate for the lack of relevant research in this area. Specifically, applications related to the genetic algorithm and differential evolution from EAs, as well as particle swarm optimization and ant colony optimization from SAs and their variants, are mainly considered in this survey.
- Research Article
5
- 10.1186/s41074-020-00064-w
- Jul 2, 2020
- IPSJ Transactions on Computer Vision and Applications
- Yasuhiro Yao + 4 more
Manually labelling point cloud scenes for use as training data in machine learning applications is a time- and labour-intensive task. In this paper, we aim to reduce the effort associated with learning semantic segmentation tasks by introducing a semi-supervised method that operates on scenes with only a small number of labelled points. For this task, we advocate the use of pseudo-labelling in combination with PointNet, a neural network architecture for point cloud classification and segmentation. We also introduce a method for incorporating information derived from spatial relationships to aid in the pseudo-labelling process. This approach has practical advantages over current methods by working directly on point clouds and not being reliant on predefined features. Moreover, we demonstrate competitive performance on scenes from three publicly available datasets and provide studies on parameter sensitivity.
- Research Article
5
- 10.1186/s41074-020-00063-x
- Apr 7, 2020
- IPSJ Transactions on Computer Vision and Applications
- Takahiro Kushida + 4 more
Phase ambiguity is a major problem in the depth measurement in either time-of-flight or phase shifting. Resolving the ambiguity using a low frequency pattern sacrifices the depth resolution, and using multiple frequencies requires a number of observations. In this paper, we propose a phase disambiguation method that combines temporal and spatial modulation so that the high depth resolution is preserved while the number of observation is kept. A key observation is that the phase ambiguities of temporal and spatial domains appear differently with respect to the depth. Using this difference, the phase can disambiguate for a wider range of interest. We develop a prototype to show the effectiveness of our method through real-world experiments.
- Research Article
4
- 10.1186/s41074-019-0062-2
- Nov 29, 2019
- IPSJ Transactions on Computer Vision and Applications
- Yang Yu + 2 more
We address a method of pedestrian segmentation in a video in a spatio-temporally consistent way. For this purpose, given a bounding box sequence of each pedestrian obtained by a conventional pedestrian detector and tracker, we construct a spatio-temporal graph on a video and segment each pedestrian on the basis of a well-established graph-cut segmentation framework. More specifically, we consider three terms as an energy function for the graph-cut segmentation: (1) a data term, (2) a spatial pairwise term, and (3) a temporal pairwise term. To maintain better temporal consistency of segmentation even under relatively large motions, we introduce a transportation minimization framework that provides a temporal correspondence. Moreover, we introduce the edge-sticky superpixel to maintain the spatial consistency of object boundaries. In experiments, we demonstrate that the proposed method improves segmentation accuracy indices, such as the average and weighted intersection of union on TUD datasets and the PETS2009 dataset at both the instance level and semantic level.
- Research Article
34
- 10.1186/s41074-019-0061-3
- Nov 20, 2019
- IPSJ Transactions on Computer Vision and Applications
- Md Zasim Uddin + 4 more
Gait-based features provide the potential for a subject to be recognized even from a low-resolution image sequence, and they can be captured at a distance without the subject’s cooperation. Person recognition using gait-based features (gait recognition) is a promising real-life application. However, several body parts of the subjects are often occluded because of beams, pillars, cars and trees, or another walking person. Therefore, gait-based features are not applicable to approaches that require an unoccluded gait image sequence. Occlusion handling is a challenging but important issue for gait recognition. In this paper, we propose silhouette sequence reconstruction from an occluded sequence (sVideo) based on a conditional deep generative adversarial network (GAN). From the reconstructed sequence, we estimate the gait cycle and extract the gait features from a one gait cycle image sequence. To regularize the training of the proposed generative network, we use adversarial loss based on triplet hinge loss incorporating Wasserstein GAN (WGAN-hinge). To the best of our knowledge, WGAN-hinge is the first adversarial loss that supervises the generator network during training by incorporating pairwise similarity ranking information. The proposed approach was evaluated on multiple challenging occlusion patterns. The experimental results demonstrate that the proposed approach outperforms the existing state-of-the-art benchmarks.
- Research Article
27
- 10.1186/s41074-019-0060-4
- Nov 4, 2019
- IPSJ Transactions on Computer Vision and Applications
- Masaki Kaga + 5 more
This paper presents a non-line-of-sight technique to estimate the position and temperature of an occluded object from a camera via reflection on a wall. Because objects with heat emit far infrared light with respect to their temperature, positions and temperatures are estimated from reflections on a wall. A key idea is that light paths from a hidden object to the camera depend on the position of the hidden object. The position of the object is recovered from the angular distribution of specular and diffuse reflection component, and the temperature of the heat source is recovered from the estimated position and the intensity of reflection. The effectiveness of our method is evaluated by conducting real-world experiments, showing that the position and the temperature of the hidden object can be recovered from the reflection destination of the wall by using a conventional thermal camera.
- Research Article
120
- 10.1186/s41074-019-0059-x
- Jul 24, 2019
- IPSJ Transactions on Computer Vision and Applications
- Eren Unlu + 3 more
Commercial Unmanned aerial vehicle (UAV) industry, which is publicly known as drone, has seen a tremendous increase in last few years, making these devices highly accessible to public. This phenomenon has immediately raised security concerns due to fact that these devices can intentionally or unintentionally cause serious hazards. In order to protect critical locations, the academia and industry have proposed several solutions in recent years. Computer vision is extensively used to detect drones autonomously compared to other proposed solutions such as RADAR, acoustics and RF signal analysis thanks to its robustness. Among these computer vision-based approaches, we see the preference of deep learning algorithms thanks to their effectiveness. In this paper, we are presenting an autonomous drone detection and tracking system which uses a static wide-angle camera and a lower-angle camera mounted on a rotating turret. In order to use memory and time efficiently, we propose a combined multi-frame deep learning detection technique, where the frame coming from the zoomed camera on the turret is overlaid on the wide-angle static camera’s frame. With this approach, we are able to build an efficient pipeline where the initial detection of small sized aerial intruders on the main image plane and their detection on the zoomed image plane is performed simultaneously, minimizing the cost of resource exhaustive detection algorithm. In addition to this, we present the integral system including tracking algorithms, deep learning classification architectures and the protocols.
- Research Article
- 10.1186/s41074-019-0058-y
- Jul 17, 2019
- IPSJ Transactions on Computer Vision and Applications
- Yuki Shiba + 4 more
The combination of a pattern projector and a camera is widely used for 3D measurement. To recover shape from a captured image, various kinds of depth cues are extracted from projected patterns in the image, such as disparities from active stereo or blurriness for depth from defocus. Recently, several techniques have been proposed to improve 3D quality using multiple depth cues by installing coded apertures in projectors or by increasing the number of projectors. However, superposition of projected patterns forms a complicated light field in 3D space, which makes the process of analyzing captured images challenging. In this paper, we propose a learning-based technique to extract depth information from such a light field, which includes multiple depth cues. In the learning phase, prior to the 3D measurement of unknown scenes, projected patterns as they appear at various depths are prepared from not only actual images but also ones generated virtually using computer graphics and geometric calibration results. Then, we use principal component analysis (PCA) to extract features of small patches. In the 3D measurement (reconstruction) phase, the same features of patches are extracted from a captured image of a target scene and compared with the learned data. By using the dimensional reduction by feature extraction, an efficient search algorithm, such as an approximated nearest neighbor (ANN), can be used for the matching process. Another important advantage of our learning-based approach is that we can use most known projection patterns without changing the algorithm.
- Research Article
2
- 10.1186/s41074-019-0057-z
- Jun 25, 2019
- IPSJ Transactions on Computer Vision and Applications
- Pramod Murthy + 4 more
Realistic estimation and synthesis of articulated human motion must satisfy anatomical constraints on joint angles. A data-driven approach is used to learn human joint limits from 3D motion capture datasets. We represent joint constraints with a new formulation (s1,s2,τ) using swing-twist representation in exponential maps form. Our parameterization is applied on Human3.6M dataset to create the lookup-map for each joint. These maps enable us to generate ‘synthetic’ datasets in entire joint rotation space of a given joint. A set of neural network discriminators is then trained with synthetic datasets to learn valid/invalid joint rotations. The discriminators achieve accuracy of [94.4−99.4%] for different joints. We validate precision-accuracy trade-off of discriminators and qualitatively evaluate classified poses with an interactive tool. The learned discriminators can be used as ‘priors’ for human pose estimation and motion synthesis.