Toward human-centric deep video understanding

Wenjun Zeng

doi:10.1017/atsip.2019.26

Abstract

AbstractPeople are the very heart of our daily work and life. As we strive to leverage artificial intelligence to empower every person on the planet to achieve more, we need to understand people far better than we can today. Human–computer interaction plays a significant role in human-machine hybrid intelligence, and human understanding becomes a critical step in addressing the tremendous challenges of video understanding. In this paper, we share our views on why and how to use a human centric approach to address the challenging video understanding problems. We discuss human-centric vision tasks and their status, highlighting the challenges and how our understanding of human brain functions can be leveraged to effectively address some of the challenges. We show that semantic models, view-invariant models, and spatial-temporal visual attention mechanisms are important building blocks. We also discuss the future perspectives of video understanding.

Highlights

Artificial intelligence (AI) is the buzz word in the technology world today
The first is the availability of the big data, e.g. thousands of hours of annotated speech, and tens of millions of labeled images
The second foundation is the availability of huge computing resources, such as GPU cards and cloud server clusters

Summary

INTRODUCTION

In the past few years, the machine has beaten humans in many ways – facial recognition, image recognition, IQ test, gaming, conversational speech recognition, reading comprehension, language translation, just to name a few All these breakthroughs are attributed to three pillars of technological innovations. The second foundation is the availability of huge computing resources, such as GPU cards and cloud server clusters On top of these two, we have witnessed the significant progress in advanced machine learning, such as deep learning and reinforcement learning. What an amazing progress in a short 4 years, from a system that was far from practical, to a system that beat human performance This demonstrates the power of deep learning. These are all the challenges that video understanding faces, making it very difficult to land the video analytics technologies to practice. Given the challenges in making video understanding technologies practical, we believe it is important to take a human-centric approach to focus on the features that are most critical for bringing the technologies to the market

HUMAN-CENTRIC

HUMAN-CENTRIC VISION TASKS

FUTURE PERSPECTIVES