First Encounter with ReLU Networks
This chapter starts by introducing the key concepts attached to neural networks, such as architecture, weights, biases, and activation function. It proceeds with the specific choice of the rectified linear unit (ReLU) as activation function. In this case, neural networks generate continuous piecewise linear (CPwL) functions. It is then shown that, in the univariate setting, any CPwL function can generated by a shallow ReLU network. This is no longer true in the multivariate setting, for which it is nonetheless shown that any CPwL function can generated by a deep ReLU network.
- Research Article
136
- 10.1016/j.compchemeng.2019.106580
- Sep 23, 2019
- Computers & Chemical Engineering
ReLU networks as surrogate models in mixed-integer linear programs
- Research Article
24
- 10.1162/neco_a_01316
- Sep 18, 2020
- Neural Computation
This letter proves that a ReLU network can approximate any continuous function with arbitrary precision by means of piecewise linear or constant approximations. For univariate function , we use the composite of ReLUs to produce a line segment; all of the subnetworks of line segments comprise a ReLU network, which is a piecewise linear approximation to . For multivariate function , ReLU networks are constructed to approximate a piecewise linear function derived from triangulation methods approximating . A neural unit called TRLU is designed by a ReLU network; the piecewise constant approximation, such as Haar wavelets, is implemented by rectifying the linear output of a ReLU network via TRLUs. New interpretations of deep layers, as well as some other results, are also presented.
- Research Article
5
- 10.1109/tit.2023.3240360
- Jun 1, 2023
- IEEE Transactions on Information Theory
We deal with two complementary questions about approximation properties of ReLU networks. First, we study how the uniform quantization of ReLU networks with real-valued weights impacts their approximation properties.We establish an upper-bound on the minimal number of bits per coordinate needed for uniformly quantized ReLU networks to keep the same polynomial asymptotic approximation speeds as unquantized ones. We also characterize the error of nearest-neighbour uniform quantization of ReLU networks. This is achieved using a new lower-bound on the Lipschitz constant of the map that associates the parameters of ReLU networks to their realization, and an upper-bound generalizing classical results. Second, we investigate when ReLU networks can be expected, or not, to have better approximation properties than other classical approximation families. Indeed, several approximation families share the following common limitation: their polynomial asymptotic approximation speed of any set is bounded from above by the encoding speed of this set.We introduce a new abstract property of approximation families, called ∞-encodability, which implies this upper-bound. Many classical approximation families, defined with dictionaries or ReLU networks, are shown to be ∞-encodable. This unifies and generalizes several situations where this upper-bound is known.
- Research Article
76
- 10.1016/j.matpur.2021.07.009
- Jul 16, 2021
- Journal de mathématiques pures et appliquées
Optimal approximation rate of ReLU networks in terms of width and depth
- Research Article
33
- 10.1016/j.neucom.2021.01.007
- Jan 12, 2021
- Neurocomputing
Optimal function approximation with ReLU neural networks
- Conference Article
4
- 10.1109/sampta45681.2019.9030992
- Jul 1, 2019
We discuss the expressive power of neural networks which use the non-smooth ReLU activation function ϱ(x) = max{0, x} by analyzing the approximation theoretic properties of such networks. The existing results mainly fall into two categories: approximation using ReLU networks with a fixed depth, or using ReLU networks whose depth increases with the approximation accuracy. After reviewing these findings, we show that the results concerning networks with fixed depth—which up to now only consider approximation in Lp(λ) for the Lebesgue measure λ—can be generalized to approximation in Lp(µ), for any finite Borel measure µ. In particular, the generalized results apply in the usual setting of statistical learning theory, where one is interested in approximation in L2(ℙ), with the probability measure ℙ describing the distribution of the data.
- Research Article
37
- 10.1016/j.neucom.2023.127174
- Dec 28, 2023
- Neurocomputing
Convergence of deep ReLU networks
- Research Article
2
- 10.1016/j.patrec.2020.06.025
- Jun 28, 2020
- Pattern Recognition Letters
If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks
- Conference Article
11
- 10.4230/lipics.itcs.2021.63
- Jan 1, 2021
- DROPS (Schloss Dagstuhl – Leibniz Center for Informatics)
The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m. \nWe show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an 𝓁₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.
- Conference Article
6
- 10.1109/spcom50965.2020.9179594
- Jul 1, 2020
Recently, it has been shown that for compressive sensing, significantly fewer measurements may be required if the sparsity assumption is replaced by the assumption the unknown vector lies near the range of a suitably-chosen generative model. In particular, in (Bora et at., 2017) it was shown roughly O(k log L) random Gaussian measurements suffice for accurate recovery when the generative model is an L-Lipschitz function with bounded k-dimensional inputs, and O(kd log w) measurements suffice when the generative model is a k-input ReLU network with depth d and width w. In this paper, we establish corresponding algorithm-independent lower bounds on the sample complexity using tools from minimax statistical analysis. In accordance with the above upper bounds, our results are summarized as follows: (i) We construct an L-Lipschitz generative model capable of generating group-sparse signals, and show that the resulting necessary number of measurements is $\Omega(k\log L)$; (ii) Using similar ideas, we construct ReLU networks with high depth and/or high width for which the necessary number of measurements scales as $\Omega\left(k d \frac{\log w}{\log n}\right)$ (with output dimension n), and in some cases $\Omega(k d \log w)$. As a result, we establish that the scaling laws derived in (Bora et al$.,2017$) are optimal or near-optimal in the absence of further assumptions.
- Book Chapter
- 10.1007/978-3-031-06773-0_16
- Jan 1, 2022
Deep neural networks often lack the safety and robustness guarantees needed to be deployed in safety critical systems. Formal verification techniques can be used to prove input-output safety properties of networks, but when properties are difficult to specify, we rely on the solution to various optimization problems. In this work, we present an algorithm called ZoPE that solves optimization problems over the output of feedforward ReLU networks with low-dimensional inputs. The algorithm eagerly splits the input space, bounding the objective using zonotope propagation at each step, and improves computational efficiency compared to existing mixed-integer programming approaches. We demonstrate how to formulate and solve three types of optimization problems: (i) minimization of any convex function over the output space, (ii) minimization of a convex function over the output of two networks in series with an adversarial perturbation in the layer between them, and (iii) maximization of the difference in output between two networks. Using ZoPE, we observe a \(25\times \) speedup on property 1 of the ACAS Xu neural network verification benchmark compared to several state-of-the-art verifiers, and an \(85\times \) speedup on a set of linear optimization problems compared to a mixed-integer programming baseline. We demonstrate the versatility of the optimizer in analyzing networks by projecting onto the range of a generative adversarial network and visualizing the differences between a compressed and uncompressed network.KeywordsNeural network verificationGlobal optimizationConvex optimizationSafety critical systems
- Book Chapter
6
- 10.1007/978-3-031-90643-5_17
- Jan 1, 2025
Branch-and-bound (BaB) is among the most effective techniques for neural network (NN) verification. However, existing works on BaB for NN verification have mostly focused on NNs with piecewise linear activations, especially ReLU networks. In this paper, we develop a general framework, named GenBaB, to conduct BaB on general nonlinearities to verify NNs with general architectures, based on linear bound propagation for NN verification. To decide which neuron to branch, we design a new branching heuristic which leverages linear bounds as shortcuts to efficiently estimate the potential improvement after branching. To decide nontrivial branching points for general nonlinear functions, we propose to pre-optimize branching points, which can be efficiently leveraged during verification with a lookup table. We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including NNs with activation functions such as Sigmoid, Tanh, Sine and GeLU, as well as NNs involving multi-dimensional nonlinear operations such as multiplications in LSTMs and Vision Transformers. Our framework also allows the verification of general nonlinear computation graphs and enables verification applications beyond simple NNs, particularly for AC Optimal Power Flow (ACOPF). GenBaB is part of the latest $$\alpha ,\!\beta $$ α , β -CROWN $$^6$$ 6 (https://github.com/Verified-Intelligence/alpha-beta-CROWN), the winner of the 4th and the 5th International Verification of Neural Networks Competition (VNN-COMP 2023 and 2024). Code for reproducing the experiments is available at https://github.com/shizhouxing/GenBaB. Appendices can be found at http://arxiv.org/abs/2405.21063.
- Research Article
506
- 10.1016/j.neunet.2018.08.019
- Sep 7, 2018
- Neural Networks
Optimal approximation of piecewise smooth functions using deep ReLU neural networks
- Research Article
9
- 10.3389/frai.2021.642374
- Dec 23, 2021
- Frontiers in Artificial Intelligence
The ability of deep neural networks to form powerful emergent representations of complex statistical patterns in data is as remarkable as imperfectly understood. For deep ReLU networks, these are encoded in the mixed discrete–continuous structure of linear weight matrices and non-linear binary activations. Our article develops a new technique for instrumenting such networks to efficiently record activation statistics, such as information content (entropy) and similarity of patterns, in real-world training runs. We then study the evolution of activation patterns during training for networks of different architecture using different training and initialization strategies. As a result, we see characteristic- and general-related as well as architecture-related behavioral patterns: in particular, most architectures form bottom-up structure, with the exception of highly tuned state-of-the-art architectures and methods (PyramidNet and FixUp), where layers appear to converge more simultaneously. We also observe intermediate dips in entropy in conventional CNNs that are not visible in residual networks. A reference implementation is provided under a free license1.
- Research Article
8
- 10.1016/j.cam.2023.115551
- Sep 20, 2023
- Journal of Computational and Applied Mathematics
SignReLU neural network and its approximation ability