Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition.

Yuxin Liu,Shuqiang Jiang,Weiqing Min,Yong Rui

doi:10.1109/tip.2024.3374211

Abstract

Recently, visual food analysis has received more and more attention in the computer vision community due to its wide application scenarios, e.g., diet nutrition management, smart restaurant, and personalized diet recommendation. Considering that food images are unstructured images with complex and unfixed visual patterns, mining food-related semantic-aware regions is crucial. Furthermore, the ingredients contained in food images are semantically related to each other due to the cooking habits and have significant semantic relationships with food categories under the hierarchical food classification ontology. Therefore, modeling the long-range semantic relationships between ingredients and the categories-ingredients semantic interactions is beneficial for ingredient recognition and food analysis. Taking these factors into consideration, we propose a multi-task learning framework for food category and ingredient recognition. This framework mainly consists of a food-orient Transformer named Convolution-Enhanced Bi-Branch Adaptive Transformer (CBiAFormer) and a multi-task category-ingredient recognition network called Structural Learning and Cross-Task Interaction (SLCI). In order to capture the complex and unfixed fine-grained patterns of food images, we propose a query-aware data-adaptive attention mechanism called Bi-Branch Adaptive Attention (BiA-Attention) in CBiAFormer, which consists of a local fine-grained branch and a global coarse-grained branch to mine local and global semantic-aware regions for different input images through an adaptive candidate key/value sets assignment for each query. Additionally, a convolutional patch embedding module is proposed to extract the fine-grained features which are neglected by Transformers. To fully utilize the ingredient information, we propose SLCI, which consists of cross-layer attention to model the semantic relationships between ingredients and two cross-task interaction modules to mine the semantic interactions between categories and ingredients. Extensive experiments show that our method achieves competitive performance on three mainstream food datasets (ETH Food-101, Vireo Food-172, and ISIA Food-200). Visualization analyses of CBiAFormer and SLCI on two tasks prove the effectiveness of our method. Codes will be released upon publication. Code and models are available at https://github.com/Liuyuxinict/CBiAFormer.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing

Lead the way for us

Journal: IEEE Transactions on Image Processing	Publication Date: Jan 1, 2024
Citations: 3

Similar Papers

Food image description based on deep-based joint food category, ingredient, and cooking method recognition
Wei-Ta Chu ... Jia-Hsing Lin
-
Wei-Ta Chu, et. al.Wei-Ta Chu ... Jia-Hsing Lin
01 Jul 2017
01 Jul 2017

Sequential Learning for Ingredient Recognition From Images
Mengyang Zhang ... Guohui Tian
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33
Mengyang Zhang, et. al.Mengyang Zhang ... Guohui Tian
01 May 2023
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33

A Visually-Aware Food Analysis System for Diet Management
Hang Wu ... Xiangxu Meng
-
Hang Wu, et. al.Hang Wu ... Xiangxu Meng
18 Jul 2022
18 Jul 2022

Deep-based Ingredient Recognition for Cooking Recipe Retrieval
Jingjing Chen ... Chong-Wah Ngo
-
Jingjing Chen, et. al.Jingjing Chen ... Chong-Wah Ngo
01 Oct 2016
01 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing