MMCNet: deep learning–based multimodal classification model using dynamic knowledge

Sung-Soo Park,Kyungyong Chung

doi:10.1007/s00779-019-01261-w

Abstract

Because of the growth of the business sector dealing in the distribution of movies, software, music, and other contents, a very large amount of contents has accumulated. Accordingly, recommendation systems for inducing user requests for contents are more important. In distribution businesses, accurate content recommendations are required to secure and retain users. To establish a highly accurate recommendation system, the recommended contents must be accurately classified. As classification methods, mainly techniques such as naive Bayes, SGD (stochastic gradient descent), and SVM (support vector machine), are utilized. If all of the information on recommended subjects is applied in the classification process, high-level accuracy can be expected, but heavy calculation, a long service time, and low scalability are incurred. Given this inefficiency, effective classification in which the metadata of contents are used is required. Metadata are expressed in the forms of the domain concept, relation, type, and attribute to allow the complicated relations between multimodal data (text, images, and video) to be processed efficiently. Most classification systems use single modal data to express one piece of knowledge for an item in a domain. Single modal data are limited in terms of improving classification accuracy, because they do not include the useful information provided by different knowledge types. Therefore, in this paper, we propose MMCNet, a deep learning–based multimodal classification model that uses dynamic knowledge. The proposed method consists of a classification model that applies the human learning principle-based CNN (convolution neural network) to multimodal data in combination with text and image knowledge. By using a Web robot agent, multimodal data are collected from the TMDb (The Movie Database) data set, which includes a variety of single modal data. In the preprocessing procedures, knowledge integration, knowledge conversion, and knowledge reduction are performed to create a quantified knowledge base. To handle text data, sentences are refined through morphological analysis and converted to numerical vectors by using word embedding. Image data are converted to numerical vectors using the library related to vector conversion. The converted feature vectors are utilized to create multimodal learning data and the classification model is used for learning. To solve the problem of memory operation resources, vector model-based meta-knowledge is expanded through expression, conversion, alignment, inference, and deep learning. To evaluate its performance, the proposed model was compared with conventional classification methods in terms of accuracy, recall, and F1-score. According to this evaluation, the proposed classification model improves the accuracy, recall, and F1-score rates more than the conventional methods. In addition, the proposed model was implemented as a deep learning–based multimodal classification system in a graphical user interface environment that allows users to provide feedback about the classification results by adjusting classification parameters. Through the convergence of the knowledge bases of various domains and multimodal deep learning, the dynamic knowledge that influences user preference is inferred.

Full Text