Abstract

In this paper, a user modeling task is examined by processing mobile device gallery of photos and videos. We propose a novel engine for preferences prediction based on scene recognition, object detection and facial analysis. At first, all faces in a gallery are clustered, and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Other photos may be sent to the remote server to be analyzed by very deep sophisticated neural networks. The visual features of each photo are obtained from scene recognition and object detection models. These features are aggregated into a single descriptor in the neural attention unit. The proposed pipeline is implemented in mobile Android application. Experimental results for the Photo Event Collection, Web Image Dataset for Event Recognition and Amazon Fashion data demonstrate the possibility to efficiently process images without significant accuracy degradation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call