School of Mathematical Science, Zhejiang Normal University, Jinhua 321004, China
Show Author Information
Hide Author Information
Abstract
We consider a fundamental problem in the field of machine learning—structural risk minimization, which can be represented as the average of a large number of smooth component functions plus a simple and convex (but possibly non-smooth) function. In this paper, we propose a novel proximal variance reducing stochastic method building on the introduced Point-SAGA. Our method achieves two proximal operator calculations by combining the fast Douglas–Rachford splitting and refers to the scheme of the FISTA algorithm in the choice of momentum factors. We show that the objective function value converges to the iteration point at the rate of when each loss function is convex and smooth. In addition, we prove that our method achieves a linear convergence rate for strongly convex and smooth loss functions. Experiments demonstrate the effectiveness of the proposed algorithm, especially when the loss function is ill-conditioned with good acceleration.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
Z.Yuan, Y.Lu, and Y.Xue, DroidDetector: Android malware characterization and detection using deep learning, Tsinghua Science and Technology, vol. 21, no. 1, pp. 114–123, 2016.
Y.Sun, Z.Dou, Y.Li, and S.Wang, Improving semantic part features for person re-identification with supervised non-local similarity, Tsinghua Science and Technology, vol. 25, no. 5, pp. 636–646, 2020.
T.Hastie, R.Tibshirani, and J.Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY, USA: Springer, 2009.
R.Johnson and T.Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 315–323.
A.Dedazio, F.Bach, and S.Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1646–1654.
O.Shamir and T.Zhang, Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, in Proc. 30th Int. Conf. Machine Learning, Atlanta, GA, USA, 2013, pp. 71–79.
S.Shalev-Shwartz and T.Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. I-64–I-72.
A.Defazio, A simple practical accelerated method for finite sums, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 676–684.
H.Lin, J.Mairal, and Z.Harchaoui, Catalyst acceleration for first-order convex optimization: From theory to practice, J. Mach. Learn. Res., vol. 18, no. 1, pp. 7854–7907, 2017.
K.Zhou, F.Shang, and J.Cheng, A simple stochastic variance reduced algorithm with fast convergence rates, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 5980–5989.
P.Panagiotis, L.Stella, and A.Bemporad, Douglas-Rachford splitting: Complexity estimates and accelerated variants, in Proc. 53rd IEEE Conf. Decision and Control, Los Angeles, CA, USA, 2014, pp. 4234–4239.
C.Lemaréchal and C.Sagastizábal, Practical aspects of the Moreau–Yosida regularization: Theoretical preliminaries, SIAM J. Optim., vol. 7, no. 2, pp. 367–385, 1997.
T.Hofmann, A.Lucchi, S.Lacoste-Julien, and B.McWilliams, Variance reduced stochastic gradient descent with neighbors, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 2305–2313.
H.Luo, X.Bai, G.Lim, and J.Peng, New global algorithms for quadratic programming with a few negative eigenvalues based on alternative direction method and convex relaxation, Math. Prog. Comp., vol. 11, no. 1, pp. 119–171, 2019.
H.Luo, X.Ding, J.Peng, R.Jiang, and D.Li, Complexity results and effective algorithms for worst-case linear optimization under uncertainties, Informs J. Comput., vol. 33, no. 1, pp. 180–197, 2021.
Lei J, Zhang Y, Zhang Z. A Variance Reducing Stochastic Proximal Method with Acceleration Techniques. Tsinghua Science and Technology, 2023, 28(6): 999-1008. https://doi.org/10.26599/TST.2022.9010051
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
ADProx-SAGA algorithm
ADProx-SAGA algorithm is introduced in Algorithm 1. It is easy to see that this algorithm maintains five sequences, , and , which are the iterations points. The initial point is chosen arbitrarily, is chosen as gradient/subgradient of at , and the algorithm adds momentum factor in addition to the parameter learning date .
In the -th iteration, the loss function is chosen randomly. The variable is updated in terms of Eq. (
4
), iterates over using Eq. (
5
). According to the definition of in Eq. (
3
) and the update of in Eq. (
6
), , , and can be treated as intermediate variables in the update from to . According to Eqs. (
2
) and (
3
), the main steps of Algorithm 1 can also be written as
where indicates the unbiased estimation of the gradient , which is similar to the gradient update formula in SAGA. The gradient table is designed to reduce the computational effort by compared to SVRG, and is the gradient mapping of at .
In each iteration of our algorithm, unlike Prox-SAGA which contains only one proximal operator to handle the nonsmooth term , we borrow the idea of the Douglas–Rachford splitting to use the proximal operator of to calculate the gradient mapping, in addition to the proximal operator of , enabling it to achieve two proximal calculations, and this design can achieve fast convergence when the loss function is ill-conditioned.
In particular, our algorithm differs from Prox-SAGA in the definition of gradient. In Prox-SAGA, is the gradient of at , while in our algorithm, is the subgradient of at the point . We have learned that Point-SAGA achieves the effect of acceleration of SAGA by involving the “future” point . According to Eq. (
5
) and the definition of , can be obtained. We can see that Algorithm 1 also involves “future” points and the two proximal operators are combined by using DR splitting. In addition, on top of combining DR splitting operators, we can achieve a faster convergence compared to Prox-SAGA and Point-SAGA by adding momentum terms to the iteration points.
We wish to speed up the proximal point gradient algorithm, so we apply Nesterov’s momentum at the iteration point . For the choice of value of , we set in Algorithm 1 according to the FISTA algorithm. However, FISTA algorithm is not guaranteed to be a descent algorithm, so it is necessary to verify the function value at the iteration point, and this step is essential in the later theoretical proofs.
In this paper, two cases (strongly convex and non-strongly convex) are considered for the properties of the objective function , and different values of are selected for these two cases. Parameters and are unknown for most problems, so ADProx-SAGA will works well in practical.
In this section, we show that ADProx-SAGA is actually a combination of Point-SAGA and DR splitting operators, and is a generalization of the Point-SAGA while establishing a connection to fast Douglas–Rachford splitting.
10.26599/TST.2022.9010051.F001
Performance of algorithms.