Abstract
A The learning process of machine learning consists of finding values of unknown weights in a cost function by minimizing the cost function based on learning data. However, since the cost function is not convex, it is conundrum to find the minimum value of the cost function. The existing methods used to find the minimum values usually use the first derivative of the cost function. When even the local minimum (but not a global minimum) is reached, since the first derivative of the cost function becomes zero, the methods give the local minimum values, so that the desired global minimum cannot be found. To overcome this problem, in this paper we modified one of the existing schemes—the adaptive momentum estimation scheme—by adding a new term, so that it can prevent the new optimizer from staying at local minimum. The convergence condition for the proposed scheme and the convergence value are also analyzed, and further explained through several numerical experiments whose cost function is non-convex.
Highlights
Deep learning is a part of a broader family of machine learning methods [1–10] based on learning data representations, as opposed to task-specific algorithms
We introduced an enhanced optimization scheme based on the popular optimization scheme, adaptive momentum estimation (Adam), for non-convex problems induced from the machine learning process
Most existing optimizers may stay at a local minimum for non-convex problems when they meet the local minimum before meeting a global minimum
Summary
Deep learning is a part of a broader family of machine learning methods [1–10] based on learning data representations, as opposed to task-specific algorithms. A machine will find appropriate weight values of data by introducing a cost function. There are several optimization schemes [11–25] which can be used to find the weights by minimizing the cost function, such as the Gradient Descent method (GD) [26]. The adaptive momentum estimation (Adam) scheme [27,28] is the most popular scheme based on the GD. The Adam is constructed by computing individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. The Adam method has been widely used, and it is well-known that it is easy to implement, computationally efficient, and works quite well in most cases
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.