Computer Vision System for Expressing Texture Using Sound-Symbolic Words.

Koichi Yamagata,Jinhwan Kwon,Wataru Shimoda,Maki Sakamoto,Takuya Kawashima

doi:10.3389/fpsyg.2021.654779

Abstract

The major goals of texture research in computer vision are to understand, model, and process texture and ultimately simulate human visual information processing using computer technologies. The field of computer vision has witnessed remarkable advancements in material recognition using deep convolutional neural networks (DCNNs), which have enabled various computer vision applications, such as self-driving cars, facial and gesture recognition, and automatic number plate recognition. However, for computer vision to “express” texture like human beings is still difficult because texture description has no correct or incorrect answer and is ambiguous. In this paper, we develop a computer vision method using DCNN that expresses texture of materials. To achieve this goal, we focus on Japanese “sound-symbolic” words, which can describe differences in texture sensation at a fine resolution and are known to have strong and systematic sensory-sound associations. Because the phonemes of Japanese sound-symbolic words characterize categories of texture sensations, we develop a computer vision method to generate the phonemes and structure comprising sound-symbolic words that probabilistically correspond to the input images. It was confirmed that the sound-symbolic words output by our system had about 80% accuracy rate in our evaluation.

Highlights

Recent years have witnessed remarkable advances in machine learning
We developed a deep convolutional neural networks (DCNNs)-based computer vision system that expresses the texture of materials using sound-symbolic words (SSWs)
In the remainder of this paper, we reported related works about material datasets in section “Material Datasets,” and we describe the new material image data set and learning model of the DCNN in section “Materials and Methods,” we describe the results in section “Results.” We validate our model by the accuracy rate of SSWs output by the system for images in section “Accuracy Evaluation”

Summary

Introduction

Recent years have witnessed remarkable advances in machine learning. One important breakthrough technique is known as “deep learning,” which uses machine learning algorithms that automatically extract high-level features in data by employing deep architectures composed of multiple non-linear transformations. Convolutional neural networks (CNNs) combined with large-scale datasets such as ImageNet (Russakovsky et al, 2014) have made great progress in object and material recognition as well as scene classification. The effective features of an image can be automatically and quantitatively extracted in the learning process when using CNNs (Krizhevsky et al, 2012; Girshick et al, 2014; Sermanet et al, 2014; Simonyan and Zisserman, 2014; Zeiler and Fergus, 2014; Szegedy et al, 2015). Google obtained state-of-the-art results (an error rate of 6.6%) in the field of object category recognition in the 2014 ImageNet Large Scale Visual Recognition Challenge. Effective methods for learning such as dropout have been reported (Srivastava et al, 2014). Cimpoi et al (2016) employed very deep CNNs for material recognition and achieved a recognition rate of 82.2% on FMD and 75.5% on the Describable Texture Dataset (DTD)

Objectives

Methods

Results

Conclusion