Abstract

Vegetable and fruit recognition can be considered as a fine-grained visual categorization (FGVC) task, which is challenging due to the large intraclass variances and small interclass variances. A mainstream direction to address the challenge is to exploit fine-grained local/global features to enhance the feature extraction and representation in the learning pipeline. However, unlike the human visual system, most of the existing FGVC methods only extract features from individual images during training. In contrast, human beings can learn discriminative features by comparing two different images. Inspired by this intuition, a recent FGVC method, named Attentive Pairwise Interaction Network (API-Net), takes as input an image pair for pairwise feature interaction and demonstrates superior performance in several open FGVC data sets. However, the accuracy of API-Net on VegFru, a domain-specific FGVC data set, is lower than expected, potentially due to the lack of spatialwise attention. Following this direction, we propose an FGVC framework named Attention-aware Interactive Features Network (AIF-Net) that refines the API-Net by integrating an attentive feature extractor into the backbone network. Specifically, we employ a region proposal network (RPN) to generate a collection of informative regions and apply a biattention module to learn global and local attentive feature maps, which are fused and fed into an interactive feature learning subnetwork. The novel neural structure is verified through extensive experiments and shows consistent performance improvement in comparison with the SOTA on the VegFru data set, demonstrating its superiority in fine-grained vegetable and fruit recognition. We also discover that a concatenation fusion operation applied in the feature extractor, along with three top-scoring regions suggested by an RPN, can effectively boost the performance.

Highlights

  • Despite the consistent improvement in the application of convolutional neural networks (CNNs) to various computer vision tasks, fine-grained visual categorization (FGVC)is still a challenging task due to the large intraclass variance, small interclass variance, and the difficulties in obtaining part annotations [1,2]

  • Attention-aware Interactive Features Network (AIF-Net) that refines the Attentive Pairwise Interaction Network (API-Net) by integrating an attentive feature extractor into the backbone network

  • We discover that a concatenation fusion operation applied in the feature extractor, along with three top-scoring regions suggested by an region proposal network (RPN), can effectively boost the performance

Read more

Summary

Introduction

Despite the consistent improvement in the application of convolutional neural networks (CNNs) to various computer vision tasks, fine-grained visual categorization (FGVC). A common goal of these methods is to enhance a model’s capability to exploit distinguishable fine-grained features from global or local regions for performance boosting Their main difference is that the former focuses on certain informative regions of an image, while the latter aims to find critical patterns from the whole image. Humans often recognize fine-grained objects by comparing image pairs to extract subtle visual differences that can be used as distinguishable features Inspired by this intuition, recent efforts have explored ways to learn interactive features from image pairs. The proposed AIF-Net consists of three components, including: (1) An attentive feature extractor that allows the network to identify and learn from critical areas in an image where distinguishable patterns may reside in.

Methods Based on Localization–Classification Subnetworks
Methods Based on End-to-End Feature Encoding
Methods Using Data Augmentation
The Attention-Aware Interactive Features Network
Attention Modules in AIF-Net
Region Proposal Network
Interactive Feature Learning
Softmax Classifier with Individual and Pair Regularization Terms sel f sel f
Data Set
Implementation Details
Performance Metric
Benchmarks
Fusion Operation for Global and Local Feature Maps
Number of Local Informative Regions
Overall Performance Comparison
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call