Mobile content-based multimedia analysis has attracted much attention with the growing popularity of high-end mobile devices. Most previous systems focus on mobile visual search, i.e., to search images with visually duplicate or near-duplicate objects (e.g., products and landmarks). There remains a strong need for effective mobile video classification solutions, where videos that are not visually duplicate or near-duplicate but are from similar high-level semantic categories can be identified. In this work, we develop a mobile video classification system based on multi-modal analysis. On the mobile side, both visual and audio features are extracted from the input video, and these features are further compressed into compact hash bits for efficient transmission. On the server side, the received hash bits are used to compute the audio and visual Bag-of-Words representations for multi-modal concept classification. We propose a novel method where hash functions are learned based on the multi-modal information from the visual and audio codewords. Compared with traditional ways of computing visual-based and audio-based hash functions based on raw visual and audio local features separately, our method exploits the co-occurrences of audio and visual codewords as augmenting information and significantly improves the classification performance. The cost budget of our system for mobile data storage, computation, and transmission is similar to that in state-of-the-art mobile visual search systems. Extensive experiments over 10,000 YouTube videos show that our system can achieve similar classification accuracy with conventional server-based video classification systems using uncompressed raw descriptors.
Read full abstract