On the Momentum-based Methods for Training and Designing Deep Neural Networks
Tan Nguyen
Dr. Tan Nguyen is currently a postdoctoral scholar in the Department of Mathematics at the University of California, Los Angeles, working with Dr. Stanley J. Osher. Tan has obtained his Ph.D. in Machine Learning from Rice University, where he was advised by Dr. Richard G. Baraniuk. His research is focused on the intersection of Deep Learning, Probabilistic Modeling, Optimization, and ODEs/PDEs. Tan gave an invited talk in the Deep Learning Theory Workshop at NeurIPS 2018 and organized the 1st Workshop on Integration of Deep Neural Models and Differential Equations at ICLR 2020. He also had two awesome long internships with Amazon AI and NVIDIA Research, during which he worked with Dr. Anima Anandkumar. Tan is the recipient of the prestigious Computing Innovation Postdoctoral Fellowship (CIFellows) from the Computing Research Association (CRA), the NSF Graduate Research Fellowship, and the IGERT Neuroengineering Traineeship. Tan received his MSEE and BSEE from Rice in May 2018 and May 2014, respectively.
Training and designing deep neural networks (DNNs) are an art that often involves expensive search over candidate optimization algorithms and network architectures. We develop novel momentum-based methods to speed up deep neural networks training and facilitate the process of designing them.
For training DNNs, we propose Scheduled Restart Stochastic Gradient Descent (SRSGD), a new Nesterov accelerated gradient (NAG)-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. We theoretically and empirically demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization. Furthermore, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
For designing DNNs, we focus on the recurrent neural networks (RNNs) and establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called MomentumRNNs. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and NAG with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance.
References:
Wang, B., Nguyen, T. M. (co-first author), Bertozzi, A. L., Baraniuk, R. G., & Osher, S. J. (2020). Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent. arXiv preprint arXiv:2002.10583. (Accepted at DeepMath 2020)
Nguyen, T. M., Baraniuk, R. G., Bertozzi, A. L., Osher, S. J., & Wang, B. (2020). MomentumRNN: Integrating Momentum into Recurrent Neural Networks. arXiv preprint arXiv:2006.06919. (Accepted at NeurIPS 2020)