Fish Motion Estimation Using ML-based Relative Depth Estimation and Multi-Object Tracking
Fish motion is a very important indicator of various health conditions of fish swarms in the fish farming industry. Many researchers have successfully analyzed fish motion information with the help of special sensors or computer vision, but their research results were either limited to few robotic fishes for ground-truth reasons or restricted to 2D space. Therefore, there is still a lack of methods that can accurately estimate the motion of a real fish swarm in 3D space. Here we present our Fish Motion Estimation (FME) algorithm that uses multi-object tracking, monocular depth estimation, and our novel post-processing approach to estimate fish motion in the world coordinate system. Our results show that the estimated fish motion approximates the ground truth very well and the achieved accuracy of 81.0% is sufficient for the use case of fish monitoring in fish farms.
- Research Article
5
- 10.1016/j.neucom.2021.11.071
- Dec 6, 2021
- Neurocomputing
Self-supervised learning of monocular depth using quantized networks
- Conference Article
3
- 10.1109/icvrv51359.2020.00046
- Nov 1, 2020
At present, the accuracy of unsupervised monocular depth estimation method is not high, the outline is not clear. To solve this problem, we propose a jointly unsupervised learning framework for monocular depth and camera motion estimation from video sequences in this work. Specifically, we introduce an Atrous Spatial Pyramid Pooling (ASPP) module and an attention model. The former module is able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter model enables the network to maintain the shape of objects and enhance edges of the depth map. Experiments on KITTI and Cityscapes datasets show that our method can effectively improve the accuracy of the monocular depth estimation, solve the depth estimation boundary blur problem and preserve the details of the depth map.
- Research Article
42
- 10.1109/jsen.2021.3120753
- Dec 1, 2021
- IEEE Sensors Journal
Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network’s ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.
- Book Chapter
- 10.1007/978-981-99-1435-7_34
- Jan 1, 2023
Deployment of depth estimating models in computer vision applications like obstacle avoidance, scene reconstruction, and camera pose estimation has become a fundamental task in the present day. Most of the depth images generated by existing monocular depth estimation models have blurry approximations of the depth and resolution, especially in low-textured regions. However, the existing depth estimation models that provide better predictions of depth generally take multiple images of the same scene as an input for generating depth maps. This paper presents a transfer learning approach with densely connected convolutional neural networks that take only an RGB image as an input for deeper and high-quality prediction of depth. In the proposed solution, an encoder-decoder architecture is leveraged for extracting features from an RGB image and generating a depth map for that corresponding RGB image. The densely connected convolutional network with 161 layers (Densenet-161) is used as an encoder, and the decoder is made up of five upsampling blocks and one transposed convolutional layer. The evaluation results obtained after training the model on the benchmark NYU V2 depth dataset ( Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor Segmentation and Support Inference from RGBD Images. In: Proceedings of the IEEE Conference on Computer Vision, pp. 746-760(2012).) have shown that the proposed approach for monocular depth estimation with the encoder-decoder architecture of lesser complexity, trained with fewer parameters and iterations, performs better than the existing state-of-the-art techniques as the value of the average Root Mean Square Error (RMSE) between the predicted and ground-truth depths, calculated in the evaluation phase of the proposed approach, is 0.505 which is less than the RMSE values obtained in case of all the mentioned monocular depth estimation methods. Furthermore, the depth maps generated by the proposed model have a good quality resolution and have minimal effect on the surrounding conditions like the texture of the walls, illumination effects, etc.
- Research Article
1
- 10.2478/acss-2025-0003
- Jan 1, 2025
- Applied Computer Systems
Monocular depth estimation is one of the essential tasks in computer vision as it can provide depth information from 2D images and is extremely beneficial for applications such as autonomous driving, robot navigation, etc. Monocular depth estimation has significantly improved over the past couple of years and deep learning-based methods have surpassed traditional and machine learning-based methods. Deep learning-based methods have further been enhanced using transformer and hybrid approaches. This paper first discusses the sensors used for depth estimation and their limitations. Then, we briefly discuss the evolution of depth estimation. Then we dive into the deep learning methods including transformer and CNN-transformer hybrid methods and their limitations. Later, we discuss several methods addressing challenging weather conditions. Finally, we discuss the current trends, challenges and future directions of the transformer and hybrid methods.
- Research Article
27
- 10.1016/j.eswa.2023.122194
- Oct 20, 2023
- Expert Systems with Applications
FishTrack: Multi-object tracking method for fish using spatiotemporal information fusion
- Book Chapter
- 10.1007/978-3-030-60633-6_29
- Jan 1, 2020
Monocular depth estimation methods based on deep learning have shown very promising results recently, most of which exploit deep convolutional neural networks (CNNs) with scene geometric constraints. However, the depth maps estimated by most existing methods still have problems such as unclear object contours and unsmooth depth gradients. In this paper, we propose a novel encoder-decoder network, named Monocular Depth estimation with Spatio-Temporal features (MD-ST), based on recurrent convolutional neural networks for monocular video depth estimation with spatio-temporal correlation features. Specifically, we put forward a novel encoder with convolutional long short-term memory (Conv-LSTM) structure for monocular depth estimation, which not only captures the spatial features of the scene but also focuses on collecting the temporal features from video sequences. In decoder, we learn four scales depth maps for multi-scale estimation to fine-tune the outputs. Additionally, in order to enhance and maintain the spatio-temporal consistency, we constraint our network with a flow consistency loss to penalize the errors between the estimated and ground-truth maps by learning residual flow vectors. Experiments conducted on the KITTI dataset demonstrate that the proposed MD-ST can effectively estimate scene depth maps, especially in dynamic scenes, which is superior to existing monocular depth estimation methods.
- Research Article
3
- 10.3390/rs14122906
- Jun 17, 2022
- Remote Sensing
Monocular depth estimation is a fundamental yet challenging task in computer vision as depth information will be lost when 3D scenes are mapped to 2D images. Although deep learning-based methods have led to considerable improvements for this task in a single image, most existing approaches still fail to overcome this limitation. Supervised learning methods model depth estimation as a regression problem and, as a result, require large amounts of ground truth depth data for training in actual scenarios. Unsupervised learning methods treat depth estimation as the synthesis of a new disparity map, which means that rectified stereo image pairs need to be used as the training dataset. Aiming to solve such problem, we present an encoder-decoder based framework, which infers depth maps from monocular video snippets in an unsupervised manner. First, we design an unsupervised learning scheme for the monocular depth estimation task based on the basic principles of structure from motion (SfM) and it only uses adjacent video clips rather than paired training data as supervision. Second, our method predicts two confidence masks to improve the robustness of the depth estimation model to avoid the occlusion problem. Finally, we leverage the largest scale and minimum depth loss instead of the multiscale and average loss to improve the accuracy of depth estimation. The experimental results on the benchmark KITTI dataset for depth estimation show that our method outperforms competing unsupervised methods.
- Book Chapter
95
- 10.1007/978-3-030-20893-6_19
- Jan 1, 2019
Depth estimation from a single image represents a very exciting challenge in computer vision. While other image-based depth sensing techniques leverage on the geometry between different viewpoints (e.g., stereo or structure from motion), the lack of these cues within a single image renders ill-posed the monocular depth estimation task. For inference, state-of-the-art encoder-decoder architectures for monocular depth estimation rely on effective feature representations learned at training time. For unsupervised training of these models, geometry has been effectively exploited by suitable images warping losses computed from views acquired by a stereo rig or a moving camera. In this paper, we make a further step forward showing that learning semantic information from images enables to improve effectively monocular depth estimation as well. In particular, by leveraging on semantically labeled images together with unsupervised signals gained by geometry through an image warping loss, we propose a deep learning approach aimed at joint semantic segmentation and depth estimation. Our overall learning framework is semi-supervised, as we deploy groundtruth data only in the semantic domain. At training time, our network learns a common feature representation for both tasks and a novel cross-task loss function is proposed. The experimental findings show how, jointly tackling depth prediction and semantic segmentation, allows to improve depth estimation accuracy. In particular, on the KITTI dataset our network outperforms state-of-the-art methods for monocular depth estimation.
- Conference Article
2
- 10.1117/12.2262439
- May 1, 2017
One of the challenges in evaluating multi-object video detection, tracking and classification systems is having publically available data sets with which to compare different systems. However, the measures of performance for tracking and classification are different. Data sets that are suitable for evaluating tracking systems may not be appropriate for classification. Tracking video data sets typically only have ground truth track IDs, while classification video data sets only have ground truth class-label IDs. The former identifies the same object over multiple frames, while the latter identifies the type of object in individual frames. This paper describes an advancement of the ground truth meta-data for the DARPA Neovision2 Tower data set to allow both the evaluation of tracking and classification. The ground truth data sets presented in this paper contain unique object IDs across 5 different classes of object (Car, Bus, Truck, Person, Cyclist) for 24 videos of 871 image frames each. In addition to the object IDs and class labels, the ground truth data also contains the original bounding box coordinates together with new bounding boxes in instances where un-annotated objects were present. The unique IDs are maintained during occlusions between multiple objects or when objects re-enter the field of view. This will provide: a solid foundation for evaluating the performance of multi-object tracking of different types of objects, a straightforward comparison of tracking system performance using the standard Multi Object Tracking (MOT) framework, and classification performance using the Neovision2 metrics. These data have been hosted publically.
- Book Chapter
8
- 10.1007/978-3-319-98776-7_38
- Nov 5, 2018
Video multi-object tracking is one of the important research topics in the field of computer vision, which is widely used in military and civil areas. At present, the research of single object tracking algorithm is quite mature, however the research of multi-object tracking is still ongoing. This paper focuses on four important stages in the multi-object tracking process: feature extraction, detector, data association and the tracker. The feature extraction part introduces the current methods of feature extraction, as well as the merits and demerits of each method; In the stage of detection, the tracking effect of the object appearance model in specific applications is described, and then the paper analyze the multi-object tracking algorithm based on detection and tracking as well as the multi-object tracking algorithm based on deep learning; In the tracking stage, the establishment of object motion model and multi-object tracking with different tracker hybrid algorithm are introduced; During the stage of data association, the paper introduce the multi-object tracking based on energy minimization and commonly used data association algorithm, respectively. Then the current mainstream datasets and evaluation methods are introduced. Finally, the future development of the multi-object tracking is discussed and forecasted.
- Book Chapter
- 10.1007/978-981-99-2969-6_9
- Jan 1, 2023
Monocular Depth Estimation (MDE) from a single image is a challenging problem in computer vision, which has been intensively investigated in the last decade using deep learning approaches. It is essential in developing cutting-edge applications like self-driving cars and augmented reality. The ability to perceive depth is essential for various tasks, including navigation and perception. Monocular depth estimation has attracted much attention. Their popularity is driven by ease of use, lower cost, ubiquitous, and denser imaging compared to other methods such as LiDAR scanners. Traditional MDE approaches heavily rely on depth cues for depth estimation and are subject to strict constraints, such as shape-from-focus and defocus algorithms, which require a low depth of field of scenes and images. MDE without some particular environmental assumptions is an ill-posed problem due to the ambiguity of mapping between the depth and intensity of color measurements. Recently, Convolutional Neural Networks (CNN) approaches have demonstrated encouraging outcomes in addressing this challenge. CNN can learn an implicit relationship between color pixels and depth. However, the mechanism and process behind the depth inference of a CNN from a single image are relatively unknown. In many applications, interpretability is very important. To address this problem, this paper tries to visualize a lightweight CNN (Fast-depth) inference in monocular depth estimation. The proposed method is based on [1], with some modifications and more analyses of the results on outdoor scenes. This method detects the smallest number of image pixels (mask) critical to infer the depth from a single image through an optimization problem. This small subset of image pixels can be used to find patterns and features that can help us to better formulate the behavior of the CNN for any future monocular depth estimation tasks.
- Research Article
325
- 10.1016/j.neucom.2020.12.089
- Jan 5, 2021
- Neurocomputing
Deep learning for monocular depth estimation: A review
- Conference Article
13
- 10.1109/ivs.2018.8500683
- Jun 1, 2018
Depth estimation provides essential information to perform autonomous driving and driver assistance. Especially, Monocular Depth Estimation is interesting from a practical point of view, since using a single camera is cheaper than many other options and avoids the need for continuous calibration strategies as required by stereo-vision approaches. State-of-the-art methods for Monocular Depth Estimation are based on Convolutional Neural Networks (CNNs). A promising line of work consists of introducing additional semantic information about the traffic scene when training CNNs for depth estimation. In practice, this means that the depth data used for CNN training is complemented with images having pixel-wise semantic labels, which usually are difficult to annotate (e.g. crowded urban images). Moreover, so far it is common practice to assume that the same raw training data is associated with both types of ground truth, i.e., depth and semantic labels. The main contribution of this paper is to show that this hard constraint can be circumvented, i.e., that we can train CNNs for depth estimation by leveraging the depth and semantic information coming from heterogeneous datasets. In order to illustrate the benefits of our approach, we combine KITTI depth and Cityscapes semantic segmentation datasets, outperforming state-of-the-art results on Monocular Depth Estimation.
- Conference Article
6
- 10.1109/iros47612.2022.9982154
- Oct 23, 2022
Advances in deep learning have resulted in steady progress in computer vision with improved accuracy on tasks such as object detection and semantic segmentation. Nevertheless, deep neural networks are vulnerable to adversarial attacks, thus presenting a challenge in reliable deployment. Two of the prominent tasks in 3D scene-understanding for robotics and advanced drive assistance systems are monocular depth and pose estimation, often learned together in an unsupervised manner. While studies evaluating the impact of adversarial attacks on monocular depth estimation exist, a systematic demonstration and analysis of adversarial perturbations against pose estimation are lacking. We show how additive imperceptible perturbations can not only change predictions to increase the trajectory drift but also catastrophically alter its geometry. We also study the relation between adversarial perturbations targeting monocular depth and pose estimation networks, as well as the transferability of perturbations to other networks with different architectures and losses. Our experiments show how the generated perturbations lead to notable errors in relative rotation and translation predictions and elucidate vulnerabilities of the networks.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.