Deep Spectral Siamese Network For Heterogeneous Object Verification In Amazon Robotic Warehouse
Automatic object verification is an important task in Amazon fulfillment centers where millions of shelves are filled with a wide assortment of packages of various sizes. In the current Amazon unstructured warehouse environment it is still cheaper to hire humans (even in the U.S.) than to develop customized robotic solutions. New advances in computer vision and deep learning can revolutionize Amazon robotic warehouse by developing new algorithms for learning to find objects by robots. The success of many deep learning algorithms depends on a large number of correct annotated data and high quality images. The images captured by Amazon robots are partially or completely occluded with plastic tapes and the content of some bins may not match the recorded inventory of that bin. Moreover, since most of the objects appear only once or twice in the entire dataset of incredibly large number of categories, it is difficult to achieve sufficient training for each class at a satisfactory level. To address these issues, we developed a novel deep spectral Siamese architecture for efficient verification with improved accuracy. Our proposed spectral Siamese network can accurately learn correlation between objects that have only a few examples for training in the noisy and blurry images. Experimental results on Amazon Bin Image demonstrate the effectiveness of the proposed framework both qualitatively and quantitatively.
- Conference Article
- 10.1109/icus50048.2020.9274829
- Nov 27, 2020
Siamese networks have obtained widespread attention in the field of visual tracking. In this paper, we propose a high-performance model based on a deep Siamese network (SiamFC-R22) for real-time visual tracking. In response to the problem that most existing Siamese trackers cannot take advantage of the more abundant feature representation provided by deep networks, we construct a deep backbone network architecture with reasonable receptive field and stride by stacking redesigned residual modules. Furthermore, we propose a multi-layer aggregation module (MLA) to fuse a series of features effectively of different layers. MLA consists of the RAC branch and the IL branch. RAC is used to boost the ability to learn the representation of high-level semantic features. IL is applied to capture the better expression of low-level features that contain more detailed information. The comprehensive experiments on the OTB2015 benchmark illustrate that our proposed SiamFC-R22 achieves an AUC of 0.667. Meanwhile, it runs at over 60 frames per second, exceeding state-of-the-art competitors with significant advantages.
- Research Article
2
- 10.1117/1.jei.29.4.043024
- Aug 25, 2020
- Journal of Electronic Imaging
Multiple region proposal networks (RPNs) have been recently combined with the Siamese network with deeper backbone networks for tracking and shown excellent accuracy with high efficiency. Although the destruction of the strict translation invariance caused by network padding in the original ResNet-50 is solved by a custom sampling strategy, its impact is not eliminated from the network structure itself, and the multilayer feature fusion is insufficient. To this end, we propose an object tracking framework based on SiamRPN with the deeper backbone networks and cascaded RPN (D-CRPN). First, we exploit the cropping-inside residual units for reforming ResNet-50 to break the spatial invariance restriction and train the robust backbone networks for visual tracking. Then, the feature transfer blocks are proposed to achieve the effective integration of the outputs of multiple blocks in a specific network stage. Finally, to improve the robustness of our tracker, we present a quality measure for the synthetic response maps of RPN modules and then use it to calculate the adaptive weights for the linear weighting method. The extensive evaluation performed on OTB100, VOT2016, and VOT2018 benchmark datasets demonstrates that the proposed D-CRPN tracker outperforms most of the state-of-the-art approaches while maintaining real-time tracking speed.
- Book Chapter
5
- 10.1007/978-3-319-51811-4_11
- Dec 31, 2016
Image recognition using deep network models has achieved remarkable progress in recent years. However, fine-grained recognition remains a big challenge due to the lack of large-scale well labeled dataset to train the network. In this paper, we study a deep network based method for fine-grained image recognition by utilizing the click-through logs from search engines. We use both click times and probability values to filter out the noise in click-through logs. Furthermore, we propose a deep siamese network model to fine-tune the classifier, emphasizing the subtle difference between different classes and tolerating the variation within the same class. Our method is evaluated by training with the Bing clickture-dog dataset and testing with the well labeled dog breed dataset. The results demonstrate great improvement achieved by our method compared with naive training.
- Conference Article
3
- 10.1117/12.2268703
- Mar 17, 2017
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
Object-to-features vectorisation is a hard problem to solve for objects that can be hard to distinguish. Siamese and Triplet neural networks are one of the more recent tools used for such task. However, most networks used are very deep networks that prove to be hard to compute in the Internet of Things setting. In this paper, a computationally efficient neural network is proposed for real-time object-to-features vectorisation into a Euclidean metric space. We use L<sub>2</sub> distance to reflect feature vector similarity during both training and testing. In this way, feature vectors we develop can be easily classified using K-Nearest Neighbours classifier. Such approach can be used to train networks to vectorise such "problematic" objects like images of human faces, keypoint image patches, like keypoints on Arctic maps and surrounding marine areas.
- Research Article
2
- 10.1109/access.2022.3143847
- Jan 1, 2022
- IEEE Access
A critical aim of pansharpening is to fuse coherent spatial and spectral features from panchromatic and multispectral images respectively. This study proposes deep siamese network based pansharpening model as a two-stage framework in a multiscale setting. In the first stage, a siamese network learns a common feature space between panchromatic and multispectral bands. The second stage follows by fusing the output feature maps of the siamese network. The parameters of these two stages are shared across scales in order to add spatial information consistently (across scales). The spectral information is preserved by adding appropriate skip connections from input multispectral image. Multi-level network parameters sharing mechanism in pyramidal reconstruction of pansharpened image, better preserves spatial and spectral details simultaneously. Experimental work carried out using deep siamese network in multi-scale setting (to obtain inter-band similarity among different sensor data) outperforms several latest pansharpening methods.
- Research Article
10
- 10.1016/j.asoc.2022.109683
- Oct 17, 2022
- Applied soft computing
McS-Net: Multi-class Siamese network for severity of COVID-19 infection classification from lung CT scan slices
- Research Article
8
- 10.1007/s00521-022-08115-2
- Dec 16, 2022
- Neural Computing and Applications
Transfer learning schemes based on deep networks which have been trained on huge image corpora offer state-of-the-art technologies in computer vision. Here, supervised and semi-supervised approaches constitute efficient technologies which work well with comparably small data sets. Yet, such applications are currently restricted to application domains where suitable deep network models are readily available. In this contribution, we address an important application area in the domain of biotechnology, the automatic analysis of CHO-K1 suspension growth in microfluidic single-cell cultivation, where data characteristics are very dissimilar to existing domains and trained deep networks cannot easily be adapted by classical transfer learning. We propose a novel transfer learning scheme which expands a recently introduced Twin-VAE architecture, which is trained on realistic and synthetic data, and we modify its specialized training procedure to the transfer learning domain. In the specific domain, often only few to no labels exist and annotations are costly. We investigate a novel transfer learning strategy, which incorporates a simultaneous retraining on natural and synthetic data using an invariant shared representation as well as suitable target variables, while it learns to handle unseen data from a different microscopy technology. We show the superiority of the variation of our Twin-VAE architecture over the state-of-the-art transfer learning methodology in image processing as well as classical image processing technologies, which persists, even with strongly shortened training times and leads to satisfactory results in this domain. The source code is available at https://github.com/dstallmann/transfer_learning_twinvae, works cross-platform, is open-source and free (MIT licensed) software. We make the data sets available at https://pub.uni-bielefeld.de/record/2960030.
- Conference Article
- 10.1109/icaica52286.2021.9498064
- Jun 28, 2021
The Siamese network base object tracking method usually removing the Padding part in the backbone network to ensure the invariance of convolution translation. That will case the trained network cannot extract the boundary information well. In order to solve this problem, in this paper a boundary expansion and redundancy cropping method is proposed to improve the boundary information extracting ability of object tracking Siamese network. Firstly, we used the proposed method to modify SiamFC (Fully-Convolutional Siamese Networks), which using AlexNet and VGGNet as backbone, respectively. The experimental results on expand Got-10k dataset show that the proposed boundary expansion and redundancy cropping method can improve tracking effect, obviously. Then we use different depth residual networks as the backbone to further test the effect of the proposed method on deep networks. The improved backbone using boundary extension and redundancy cropping can suppress performance degradation better than the original backbone without Padding. The results show that the improved backbone using boundary extension and redundancy cropping has higher accuracy at multiple depths than the original backbone without Padding. Results on OTB, Tcolor128 and other datasets also show that the proposed method can improves the success rate and accuracy rate on the premise of maintaining real-time performance.
- Research Article
20
- 10.1145/3184745
- Apr 25, 2018
- ACM Transactions on Multimedia Computing, Communications, and Applications
Features extracted by deep networks have been popular in many visual search tasks. This article studies deep network structures and training schemes for mobile visual search. The goal is to learn an effective yet portable feature representation that is suitable for bridging the domain gap between mobile user photos and (mostly) professionally taken product images while keeping the computational cost acceptable for mobile-based applications. The technical contributions are twofold. First, we propose an alternative of the contrastive loss popularly used for training deep Siamese networks, namely robust contrastive loss, where we relax the penalty on some positive and negative pairs to alleviate overfitting. Second, a simple multitask fine-tuning scheme is leveraged to train the network, which not only utilizes knowledge from the provided training photo pairs but also harnesses additional information from the large ImageNet dataset to regularize the fine-tuning process. Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multitask fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.
- Research Article
1
- 10.11591/ijai.v13.i1.pp1112-1118
- Mar 1, 2024
- IAES International Journal of Artificial Intelligence (IJ-AI)
<p><span>Optical character recognition (OCR) is a technology that allows computers to recognize and extract text from images or scanned documents. It is commonly used to convert printed or handwritten text into machine-readable format. This Study presents an OCR system on Kannada Characters based on siamese neural network (SNN). Here the SNN, a Deep neural network which comprises of two identical convolutional neural network (CNN) compare the script and ranks based on the dissimilarity. When lesser dissimilarity score is identified, prediction is done as character match. In this work the authors use 5 classes of Kannada characters which were initially preprocessed using grey scaling and convert it to pgm format. This is directly input into the Deep convolutional network which is learnt from matching and non-matching image between the CNN with contrastive loss function in Siamese architecture. The Proposed OCR system uses very less time and gives more accurate results as compared to the regular CNN. The model can become a powerful tool for identification, particularly in situations where there is a high degree of variation in writing styles or limited training data is available.</span></p>
- Conference Article
5
- 10.1109/ecti-con51831.2021.9454677
- May 19, 2021
This paper presents a new method in fingerspelling recognition in highly dynamic video sequences. Sign language videos are labeled only in a video sequence level. A deep learning network extracts spatial features of video frames with the AlexNet and uses them to derive a language model with the Long-Short Term Memory (LSTM) network. The results of this deep learning network are the predicted fingerspelling gestures at a frame level. The recognition results of testing video sequences with 100 percent accuracy are used to improve spatial features of video frames. We construct a Siamese network from the recognition results in the first recognition pass. A network deployed in the Siamese network is the ResNet-50. We employ the Siamese network to derive the efficient representation of each fingerspelling gesture. The derived features corresponding to each video frame are fed to the LSTM network to predict fingerspelling gestures. Our proposed method can outperform state of the art fingerspelling recognition algorithms by almost four percent in recognition accuracy from our experimental results.
- Research Article
1
- 10.14738/tmlai.81.8020
- Apr 30, 2020
- Transactions on Machine Learning and Artificial Intelligence
Siamese networks have drawn increasing interest in the field of visual object tracking due to their balance of precision and efficiency. However, Siamese trackers use relatively shallow backbone networks, such as AlexNet, and therefore do not take full advantage of the capabilities of modern deep convolutional neural networks (CNNs). Moreover, the feature representations of the target object in a Siamese tracker are extracted through the last layer of CNNs and mainly capture semantic information, which causes the tracker's precision to be relatively low and to drift easily in the presence of similar distractors. In this paper, a new nonpadding residual unit (NPRU) is designed and used to stack a 22-layer deep ResNet, referred as ResNet22. After utilizing ResNet22 as the backbone network, we can build a deep Siamese network, which can greatly enhance the tracking performance. Considering that the different levels of the feature maps of the CNN represent different aspects of the target object, we aggregated different deep convolutional layers to make use of ResNet22's multilevel feature maps, which can form hyperfeature representations of targets. The designed deep hyper Siamese network is named DHSiam. Experimental results show that DHSiam has achieved significant improvement on multiple benchmark datasets.
- Research Article
37
- 10.1093/bib/bbab534
- Jan 18, 2022
- Briefings in Bioinformatics
Predicting the response of cancer patients to a particular treatment is a major goal of modern oncology and an important step toward personalized treatment. In the practical clinics, the clinicians prefer to obtain the most-suited drugs for a particular patient instead of knowing the exact values of drug sensitivity. Instead of predicting the exact value of drug response, we proposed a deep learning-based method, named Siamese Response Deep Factorization Machines (SRDFM) Network, for personalized anti-cancer drug recommendation, which directly ranks the drugs and provides the most effective drugs. A Siamese network (SN), a type of deep learning network that is composed of identical subnetworks that share the same architecture, parameters and weights, was used to measure the relative position (RP) between drugs for each cell line. Through minimizing the difference between the real RP and the predicted RP, an optimal SN model was established to provide the rank for all the candidate drugs. Specifically, the subnetwork in each side of the SN consists of a feature generation level and a predictor construction level. On the feature generation level, both drug property and gene expression, were adopted to build a concatenated feature vector, which even enables the recommendation for newly designed drugs with only chemical property known. Particularly, we developed a response unit here to generate weighted genetic feature vector to simulate the biological interaction mechanism between a specific drug and the genes. For the predictor construction level, we built this level integrating a factorization machine (FM) component with a deep neural network component. The FM can well handle the discrete chemical information and both low-order and high-order feature interactions could be sufficiently learned. Impressively, the SRDFM works well on both single-drug recommendation and synergic drug combination. Experiment result on both single-drug and synergetic drug data sets have shown the efficiency of the SRDFM. The Python implementation for the proposed SRDFM is available at at https://github.com/RanSuLab/SRDFM Contact: ran.su@tju.edu.cn, gbx@mju.edu.cn and weileyi@sdu.edu.cn.
- Research Article
576
- 10.1109/lgrs.2017.2738149
- Oct 1, 2017
- IEEE Geoscience and Remote Sensing Letters
In this letter, we propose a novel supervised change detection method based on a deep siamese convolutional network for optical aerial images. We train a siamese convolutional network using the weighted contrastive loss. The novelty of the method is that the siamese network is learned to extract features directly from the image pairs. Compared with hand-crafted features used by the conventional change detection method, the extracted features are more abstract and robust. Furthermore, because of the advantage of the weighted contrastive loss function, the features have a unique property: the feature vectors of the changed pixel pair are far away from each other, while the ones of the unchanged pixel pair are close. Therefore, we use the distance of the feature vectors to detect changes between the image pair. Simple threshold segmentation on the distance map can even obtain good performance. For improvement, we use a $k$ -nearest neighbor approach to update the initial result. Experimental results show that the proposed method produces results comparable, even better, with the two state-of-the-art methods in terms of F-measure.
- Research Article
93
- 10.1016/j.media.2019.101618
- Nov 21, 2019
- Medical Image Analysis
Regularized siamese neural network for unsupervised outlier detection on brain multiparametric magnetic resonance imaging: Application to epilepsy lesion screening.