Ouk Choi Professor Ouk Choi from Incheon National University, Korea, talks to Electronics Letters about the paper ‘Robust alternating optimization for extrinsic calibration of RGB-D cameras’, page 992. My research field is computer vision, and I am currently interested in the acquisition, processing, and understanding of 3D data. During my Ph.D. studies, my research topic was to solve highly ambiguous correspondence problems such as matching windows in image pairs of buildings. After receiving my Ph.D. degree, I joined the Samsung Advanced Institute of Technology, where I conducted research projects on time-of-flight cameras. I was intrigued by the fact that time-of-flight cameras can acquire 3D data without solving difficult correspondence problems. Since then, my research has been directed toward 3D computer vision. Nowadays, I am building systems that can acquire multi-view depth images of subjects, so that the acquired datasets can be used for machine learning and other research purposes. Early time-of-flight cameras had many problems such as depth noise, low resolution, and phase wrapping. Many of these problems seem to have been solved in commercially available RGB-D cameras such as the Kinect v2. In addition, they are factory calibrated, so that we can access the intrinsic parameters without calibration. However, if we want to use multiple RGB-D cameras with their 3D measurements represented in the same reference coordinate system, finding the extrinsic parameters such as the rotation and translation between the cameras needs to be done manually. In my Letter, we describe a robust and efficient bundle adjustment algorithm that is indispensable in realising a fully automatic, easy-to-use, and online extrinsic calibration method. In a previous work referenced in my Letter, Su et al. proposed an efficient bundle adjustment algorithm, which alternates between solving for better extrinsic parameters and solving for better calibration target locations. In the presence of outliers, the algorithm tends to fail but this problem was not addressed in their work. On the other hand, in our previous work, Kwon et al. proposed a robust bundle adjustment algorithm, which works in the presence of outliers. However, the algorithm is highly inefficient. In my Letter, my colleagues and I have reported a bundle adjustment algorithm that is as efficient as Su et al.’s algorithm and as robust as Kwon et al.’s algorithm. The most attractive point of our work is that we can perform online bundle adjustment with the proposed algorithm. Although the execution time is not real-time with a single processor, the computational complexity of the algorithm is linear with the number of RGB-D cameras and the number of calibration images. In addition, the algorithm is highly parallelisable, which means that it can be implemented to work in real-time with sufficient computing resources. This will enable online extrinsic calibration of multiple RGB-D cameras. Once we compute an initial solution, we can add images of the spherical calibration target and apply our bundle adjustment algorithm frame-by-frame until a certain level of accuracy is attained. We plan to capture 3D data of human subjects as such data can be used to build a learning-based generative model of 3D human shape and pose as well as establish a parametric generative model. In addition, such a multi-view capture system can be used for analysing the interaction between a person and an object or for analysing social behaviour among people. The collected data and developed algorithms will be used to realise a virtual agent who seems to act human. In the future, it will be possible to generate a movie with only a scenario, using the virtual actors and their human-like behaviours. Traditionally, 3D computer vision was a combination of theories on multiple-view geometry and optimisation-based parameter estimation algorithms. Many research projects have contributed to building 3D models of our surrounding world, and it is now easy to access the online 3D maps provided by Google or Apple. Optimisation-based estimation algorithms require good initial solutions based on theories and feature correspondence across multiple views. Recently, machine learning algorithms such as random forests and deep neural networks are directly applied to raw data, successfully replacing the expert-knowledge-based methods for finding initial solutions. In addition, the research interest is being shifted from static scenes to dynamic objects. Deep learning-based methods are now applied to reconstructing and analysing human performance capture, from which we can obtain a generative model, which can change its shape, pose, and facial expression. As generated models get more accurate and realistic, we will be able to see virtual humanoids in the 3D maps of our surrounding world. I guess that it will take no more than ten years to see such human-like dynamic agents.