Stochastic Gradient Descent (SGD) is a fundamental optimization technique in machine learning, due to its efficiency in handling large-scale data. Unlike typical SGD applications, which rely on stochastic approximations, this work explores the convergence properties of SGD from a deterministic perspective. We address the crucial aspect of learning rate settings, a common obstacle in optimizing SGD performance, particularly in complex environments. In contrast to traditional methods that often provide convergence results based on statistical expectations (which are usually not justified), our approach introduces universally applicable learning rates. These rates ensure that a model trained with SGD matches the performance of the best linear filter asymptotically, applicable irrespective of the data sequence length and independent of statistical assumptions about the data. By establishing learning rates that scale as μ=O(1t), we offer a solution that sidesteps the need for prior data knowledge, a prevalent limitation in real-world applications. To this end, we provide a robust framework for SGD's application across varied settings, guaranteeing convergence results that hold under both deterministic and stochastic scenarios without any underlying assumptions.
Read full abstract