Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Multi-view Stereo (MVS) with known camera parameters is essentially a 1D search problem within a valid depth range. Recent deep learning-based MVS methods typically densely sample depth hypotheses in the depth range, and then construct prohibitively memory-consuming 3D cost volumes for depth prediction. Although coarse-to-fine sampling strategies alleviate this overhead issue to a certain extent, the efficiency of MVS is still an open challenge. In this work, we propose a novel method for highly efficient MVS that remarkably decreases the memory footprint, meanwhile clearly advancing state-of-the-art depth prediction performance. We investigate what a search strategy can be reasonably optimal for MVS taking into account of both efficiency and effectiveness. We first formulate MVS as a binary search problem, and accordingly propose a generalized binary search network for MVS. Specifically, in each step, the depth range is split into 2 bins with extra 1 error tolerance bin on both sides. A classification is performed to identify which bin contains the true depth. We also design three mechanisms to respectively handle classification errors, deal with out-of-range samples and decrease the training memory. The new formulation makes our method only sample a very small number of depth hypotheses in each step, which is highly memory efficient, and also greatly facilitates quick training convergence. Experiments on competitive benchmarks show that our method achieves state-of-the-art accuracy with much less memory. Particularly, our method obtains an overall score of 0.289 on DTU dataset and tops the first place on challenging Tanks and Temples advanced dataset among all the learning-based methods. Our code will be released at https://github.com/MiZhenxing/GBi-Net.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1109/tpami.2026.3654665
Learning-Based Multi-View Stereo: A Survey.
  • Jan 1, 2026
  • IEEE transactions on pattern analysis and machine intelligence
  • Fangjinhua Wang + 7 more

3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.cag.2021.04.016
Adaptive depth estimation for pyramid multi-view stereo
  • Apr 24, 2021
  • Computers & Graphics
  • Jie Liao + 4 more

Adaptive depth estimation for pyramid multi-view stereo

  • Research Article
  • Cite Count Icon 28
  • 10.1016/j.engappai.2023.107800
SA-MVSNet: Self-attention-based multi-view stereo network for 3D reconstruction of images with weak texture
  • Jan 1, 2024
  • Engineering Applications of Artificial Intelligence
  • Ronghao Yang + 5 more

SA-MVSNet: Self-attention-based multi-view stereo network for 3D reconstruction of images with weak texture

  • Conference Article
  • Cite Count Icon 369
  • 10.1109/cvpr42600.2020.00260
Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness
  • Jun 1, 2020
  • Shuo Cheng + 6 more

We present Uncertainty-aware Cascaded Stereo Network (UCS-Net) for 3D reconstruction from multiple RGB images. Multi-view stereo (MVS) aims to reconstruct fine-grained scene geometry from multi-view images. Previous learning-based MVS methods estimate per-view depth using plane sweep volumes (PSVs) with a fixed depth hypothesis at each plane; this requires densely sampled planes for high accuracy, which is impractical for high-resolution depth because of limited memory. In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions. Our UCS-Net has three stages: the first stage processes a small PSV to predict low-resolution depth; two ATVs are then used in the following stages to refine the depth with higher resolution and higher accuracy. Our ATV consists of only a small number of planes with low memory and computation costs; yet, it efficiently partitions local depth ranges within learned small uncertainty intervals. We propose to use variance-based uncertainty estimates to adaptively construct ATVs; this differentiable process leads to reasonable and fine-grained spatial partitioning. Our multi-stage framework progressively sub-divides the vast scene space with increasing depth resolution and precision, which enables reconstruction with high completeness and accuracy in a coarse-to-fine fashion. We demonstrate that our method achieves superior performance compared with other learning-based MVS methods on various challenging datasets.

  • Research Article
  • 10.61356/j.nois.2025.8646
Image Classification Using Deep Learning: A Systematic Review
  • Dec 24, 2025
  • Neutrosophic Optimization and Intelligent Systems
  • Maher Khalaf Hussein + 2 more

Image classification is a fundamental problem in the field of computer vision and involves assigning a label to an image based on its content. In this paper, we survey both traditional machine learning-based image classification methods and deep learning-based image classification methods, specifically, we review two deep learning-based image classification methods: a convolutional neural network (CNN) method and a pre-trained CNN-based transfer learning method for image classification. We first briefly review traditional machine learning-based image classification methods and then deep learning-based image classification methods, discussing both the feature extraction methods and the classification methods used in the deep learning-based methods. We discuss several deep neural network architectures for image classification, including LeNet, AlexNet, VGGNet, GoogleNet, ResNet, and DenseNet, and finally, we conclude with a discussion on the applications of image classification and compare the different methods based on various factors both qualitatively and quantitatively, and present our experimental results, while also summarizing various methods of data augmentation, batch normalization, and regularization by dropout. We introduce techniques for practical training of deep networks, and discuss fine-tuning, pruning, and model quantization for efficient inference of trained neural networks, and finally, we introduce and discuss the runtime engines for the deployment of trained neural networks for efficient inference.

  • Research Article
  • Cite Count Icon 34
  • 10.1109/tip.2023.3272170
NR-MVSNet: Learning Multi-view Stereo based on Normal Consistency and Depth Refinement.
  • Jan 1, 2023
  • IEEE Transactions on Image Processing
  • Jingliang Li + 4 more

Multi-view Stereo (MVS) aims to reconstruct a 3D point cloud model from multiple views. In recent years, learning-based MVS methods have received a lot of attention and achieved excellent performance compared with traditional methods. However, these methods still have apparent shortcomings, such as the accumulative error in the coarse-to-fine strategy and the inaccurate depth hypotheses based on the uniform sampling strategy. In this paper, we propose the NR-MVSNet, a coarse-to-fine structure with the depth hypotheses based on the normal consistency (DHNC) module, and the depth refinement with reliable attention (DRRA) module. Specifically, we design the DHNC module to generate more effective depth hypotheses, which collects the depth hypotheses from neighboring pixels with the same normals. As a result, the predicted depth can be smoother and more accurate, especially in texture-less and repetitive-texture regions. On the other hand, we update the initial depth map in the coarse stage by the DRRA module, which can combine attentional reference features and cost volume features to improve the depth estimation accuracy in the coarse stage and address the accumulative error problem. Finally, we conduct a series of experiments on the DTU, BlendedMVS, Tanks & Temples, and ETH3D datasets. The experimental results demonstrate the efficiency and robustness of our NR-MVSNet compared with the state-of-the-art methods. Our implementation is available at https://github.com/wdkyh/NR-MVSNet.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.cag.2023.08.014
AdaptMVSNet: Efficient Multi-View Stereo with adaptive convolution and attention fusion
  • Aug 9, 2023
  • Computers & Graphics
  • Pengfei Jiang + 4 more

AdaptMVSNet: Efficient Multi-View Stereo with adaptive convolution and attention fusion

  • Conference Article
  • Cite Count Icon 862
  • 10.1109/cvpr42600.2020.00257
Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching
  • Jun 1, 2020
  • Xiaodong Gu + 5 more

The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 35.6% improvement on DTU benchmark (1st place), with 50.6% and 59.3% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method. Our source code is available at https://github.com/alibaba/cascade-stereo.

  • Conference Article
  • Cite Count Icon 66
  • 10.1109/icmla.2019.00105
A Comparative Analysis of Traditional and Deep Learning-Based Anomaly Detection Methods for Streaming Data
  • Dec 1, 2019
  • Mohsin Munir + 3 more

With the Internet of Things (IoT) devices becoming an integral part of human life, the need for robust anomaly detection in streaming data has also been elevated. Dozens of distance-based, density-based, kernel-based, and cluster-based algorithms have been proposed in the area of anomaly detection. Recently, because of the robustness of the deep neural networks (DNN), different deep learning-based anomaly detection methods have also been proposed. With all these rapid developments, there exists a small number of comparative studies for anomaly detection methods. Even in those studies, the comparison is done only in typical anomaly detection settings without taking the streaming data into consideration. The presence of intrinsic time-series characteristics like trend, seasonality, and change-point makes it important to study the behavior of commonly used anomaly detection methods on streaming data. Moreover, the comparison of traditional methods with deep learning-based methods also brings exciting insights about the data which are generally overlooked by traditional methods. In this study, we compare 13 anomaly detection methods on two commonly used streaming data sets. We used four different evaluation metrics to evaluate the methods from different perspectives. Our analysis reveals that the deep learning-based anomaly detection methods are superior to traditional anomaly detection methods.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-19-6068-0_34
A Review of the Detection of Pulmonary Embolism from Computed Tomography Images Using Deep Learning Methods
  • Nov 23, 2022
  • Manas Pratim Das + 1 more

Medical imaging has been evolving at a steady pace generating enormous amounts of health data, and the use of deep learning (DL) has helped a great deal in processing the detailed data. Deep learning-based methods are used in different medical imaging tasks to detect and diagnose diseases. For example, medical imaging is used to diagnose pulmonary embolism (PE), a commonly occurring cardiovascular disease with high mortality and prevalence and a low diagnosis rate. According to medical experts, PE has resulted in many deaths because of missed diagnoses for the medical condition. Another critical aspect of the disease is the possibility of permanent lung damage if left untreated. The use of deep learning methods in medical imaging is attributed to their ability to use learning-based methods to process enormous amounts of data. However, there are some unique challenges in the detection of PE. PE is not specific in its clinical presentation and is easily ignored, making it difficult to diagnose. Deep learning-based detection methods help a great deal in the disease detection in miniature sub-branches of the alveoli, and images with noisy artifacts easily compared to manual diagnosis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/rs14081854
Exploiting Graph and Geodesic Distance Constraint for Deep Learning-Based Visual Odometry
  • Apr 12, 2022
  • Remote Sensing
  • Xu Fang + 4 more

Visual odometry is the task of estimating the trajectory of the moving agents from consecutive images. It is a hot research topic both in robotic and computer vision communities and facilitates many applications, such as autonomous driving and virtual reality. The conventional odometry methods predict the trajectory by utilizing the multiple view geometry between consecutive overlapping images. However, these methods need to be carefully designed and fine-tuned to work well in different environments. Deep learning has been explored to alleviate the challenge by directly predicting the relative pose from the paired images. Deep learning-based methods usually focus on the consecutive images that are feasible to propagate the error over time. In this paper, graph loss and geodesic rotation loss are proposed to enhance deep learning-based visual odometry methods based on graph constraints and geodesic distance, respectively. The graph loss not only considers the relative pose loss of consecutive images, but also the relative pose of non-consecutive images. The relative pose of non-consecutive images is not directly predicted but computed from the relative pose of consecutive ones. The geodesic rotation loss is constructed by the geodesic distance and the model regresses a Lie algebra so(3) (3D vector). This allows a robust and stable convergence. To increase the efficiency, a random strategy is adopted to select the edges of the graph instead of using all of the edges. This strategy provides additional regularization for training the networks. Extensive experiments are conducted on visual odometry benchmarks, and the obtained results demonstrate that the proposed method has comparable performance to other supervised learning-based methods, as well as monocular camera-based methods. The source code and the weight are made publicly available.

  • Conference Article
  • Cite Count Icon 42
  • 10.1109/cvpr52688.2022.00847
PlaneMVS: 3D Plane Reconstruction from Multi-View Stereo
  • Jun 1, 2022
  • Jiachen Liu + 6 more

We present a novel framework named PlaneMVS for 3D plane reconstruction from multiple input views with known camera poses. Most previous learning-based plane reconstruction methods reconstruct 3D planes from single images, which highly rely on single-view regression and suffer from depth scale ambiguity. In contrast, we reconstruct 3D planes with a multi-view-stereo (MVS) pipeline that takes advantage of multi-view geometry. We decouple plane reconstruction into a semantic plane detection branch and a plane MVS branch. The semantic plane detection branch is based on a single-view plane detection framework but with differences. The plane MVS branch adopts a set of slanted plane hypotheses to replace conventional depth hypotheses to perform plane sweeping strategy and finally learns pixel-level plane parameters and its planar depth map. We present how the two branches are learned in a balanced way, and propose a soft-pooling loss to associate the outputs of the two branches and make them benefit from each other. Extensive experiments on various indoor datasets show that PlaneMVS significantly outperforms state-of-the-art (SOTA) single-view plane reconstruction methods on both plane detection and 3D geometry metrics. Our method even outperforms a set of SOTA learning-based MVS methods thanks to the learned plane priors. To the best of our knowledge, this is the first work on 3D plane reconstruction within an end-to-end MVS framework.

  • Research Article
  • Cite Count Icon 29
  • 10.1016/j.patcog.2023.109885
ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive depth range and depth interval
  • Aug 14, 2023
  • Pattern Recognition
  • Song Zhang + 5 more

ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive depth range and depth interval

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.ins.2023.120056
Using outlier elimination to assess learning-based correspondence matching methods
  • Jan 2, 2024
  • Information Sciences
  • Xintao Ding + 4 more

Recently, deep learning (DL) technology has been widely used in correspondence matching. The learning-based models are usually trained on benign image pairs with partial overlaps. Since DL model is usually data-dependent, non-overlapping images may be used as poison samples to fool the model and produce false registrations. In this study, we propose an outlier elimination-based assessment method (OEAM) to assess the registrations of learning-based correspondence matching method on partially overlapping and non-overlapping images. OEAM first eliminates outliers based on spatial paradox. Then OEAM implements registration assessment in two streams using the obtained core correspondence set. If the cardinality of the core set is sufficiently small, the input registration is assessed as a low-quality registration. Otherwise, it is assessed to be of high quality, and OEAM improves its registration performance using the core set. OEAM is a post-processing technique imposed on learning-based method. The comparison experiments are implemented on outdoor (YFCC100M) and indoor (SUN3D) datasets using four deep learning-based methods. The experimental results on registrations of partially overlapping images show that OEAM can reliably infer low-quality registrations and improve performance on high-quality registrations. The experiments on registrations of non-overlapping images demonstrate that learning-based methods are vulnerable to poisoning attacks launched by non-overlapping images, and OEAM is robust against poisoning attacks crafted by non-overlapping images.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 279
  • 10.1039/d3sc04185a
PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences.
  • Jan 1, 2024
  • Chemical Science
  • Martin Buttenschoen + 2 more

The last few years have seen the development of numerous deep learning-based protein-ligand docking methods. They offer huge promise in terms of speed and accuracy. However, despite claims of state-of-the-art performance in terms of crystallographic root-mean-square deviation (RMSD), upon closer inspection, it has become apparent that they often produce physically implausible molecular structures. It is therefore not sufficient to evaluate these methods solely by RMSD to a native binding mode. It is vital, particularly for deep learning-based methods, that they are also evaluated on steric and energetic criteria. We present PoseBusters, a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. The PoseBusters test suite validates chemical and geometric consistency of a ligand including its stereochemistry, and the physical plausibility of intra- and intermolecular measurements such as the planarity of aromatic rings, standard bond lengths, and protein-ligand clashes. Only methods that both pass these checks and predict native-like binding modes should be classed as having "state-of-the-art" performance. We use PoseBusters to compare five deep learning-based docking methods (DeepDock, DiffDock, EquiBind, TankBind, and Uni-Mol) and two well-established standard docking methods (AutoDock Vina and CCDC Gold) with and without an additional post-prediction energy minimisation step using a molecular mechanics force field. We show that both in terms of physical plausibility and the ability to generalise to examples that are distinct from the training data, no deep learning-based method yet outperforms classical docking tools. In addition, we find that molecular mechanics force fields contain docking-relevant physics missing from deep-learning methods. PoseBusters allows practitioners to assess docking and molecular generation methods and may inspire new inductive biases still required to improve deep learning-based methods, which will help drive the development of more accurate and more realistic predictions.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant