Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters

Yufereva, Olga; Persiianov, Michael; Dvurechensky, Pavel; Gasnikov, Alexander; Kovalev, Dmitry

doi:10.1007/s10287-023-00493-9

Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters

Original Paper
Published: 16 December 2023

Volume 21, article number 12, (2024)
Cite this article

Computational Management Science Aims and scope Submit manuscript

Olga Yufereva¹,
Michael Persiianov²,
Pavel Dvurechensky³,
Alexander Gasnikov^2,4,5 &
…
Dmitry Kovalev⁶

145 Accesses
Explore all metrics

Abstract

Inspired by recent advances in distributed algorithms for approximating Wasserstein barycenters, we propose a novel distributed algorithm for this problem. The main novelty is that we consider time-varying computational networks, which are motivated by examples when only a subset of sensors can observe each time step, and yet, the goal is to average signals (e.g., satellite pictures of some area) by approximating their barycenter. We embed this problem into a class of non-smooth dual-friendly distributed optimization problems over time-varying networks and develop a first-order method for this class. We prove non-asymptotic accelerated in the sense of Nesterov convergence rates and explicitly characterize their dependence on the parameters of the network and its dynamics. In the experiments, we demonstrate the efficiency of the proposed algorithm when applied to the Wasserstein barycenter problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An away-step Frank–Wolfe algorithm for constrained multiobjective optimization

Article 07 May 2024

An elementary proof of the Brouwer’s fixed point theorem

Article Open access 14 March 2022

Multi-step inertial algorithms for equilibrium, fixed point, general systems of variational inequalities and split feasibility problems

Article 27 January 2024

Notes

e.g., one can take $f_i^{\gamma }(x) = f_i(x) + \frac{\gamma }{2}\Vert x\Vert ^2_2$.

References

Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43(2):904–924
Article Google Scholar
Barrio E, Gine E, Matran C (1999) Central limit theorems for the Wasserstein distance between the empirical and the true distributions. Ann Probab 27(2):1009–1071
Google Scholar
Bigot J, Cazelles E, Papadakis N (2019) Data-driven regularization of Wasserstein barycenters with an application to multivariate density registration. Inf. Inference J IMA 8(4):719–755
Google Scholar
Bishop AN, Doucet A (2021) Network consensus in the Wasserstein metric space of probability measures. SIAM J Control Optim 59(5):3261–3277
Article Google Scholar
Boissard E, Le Gouic T, Loubes J-M (2015) Distribution’s template estimate with Wasserstein metrics. Bernoulli 21(2):740–759
Article Google Scholar
Cuturi M, Peyré G (2015) A smoothed dual approach for variational Wasserstein problems. SIAM J. Imag. Sci. 9(1):320–343
Article Google Scholar
Cuturi M, Doucet A (2014) Fast computation of wasserstein barycenters. In: International Conference on Machine Learning. PMLR, pp. 685–693
Devolder O, Glineur F, Nesterov Y (2012) Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J Optim 22(2):702–727
Article Google Scholar
Dvinskikh D (2021) Decentralized algorithms for wasserstein barycenters. PhD thesis, Humboldt Universitaet zu Berlin (Germany)
Dvinskikh D, Gorbunov E, Gasnikov A, Dvurechensky P, Uribe CA (2019) On primal and dual approaches for distributed stochastic convex optimization over networks. In: 2019 IEEE 58th Conference on Decision and Control (CDC), IEEE, pp. 7435– 7440
Dvinskikh D, Tiapkin D (2021) Improved complexity bounds in Wasserstein barycenter problem. In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1738– 1746
Dvurechenskii P, Dvinskikh D, Gasnikov A, Uribe C, Nedich A (2018) Decentralize and randomize: faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems 31
Flamary R, Courty N, Gramfort A, Alaya MZ, Boisbunon A, Chambon S, Chapel L, Corenflos A, Fatras K, Fournier N, Gautheron L, Gayraud NTH, Janati H, Rakotomamonjy A, Redko I, Rolet A, Schutz A, Seguy V, Sutherland DJ, Tavenard R, Tong A, Vayer T (2021) Pot: Python optimal transport. J Mach Learn Res 22(78):1–8
Google Scholar
Gasnikov AV, Gasnikova E, Nesterov YE, Chernov A (2016) Efficient numerical methods for entropy-linear programming problems. Comput Math Math Phys 56(4):514–524
Article Google Scholar
Gorbunov E, Rogozin A, Beznosikov A, Dvinskikh D, Gasnikov A (2022). In: Nikeghbali A, Pardalos PM, Raigorodskii AM, Rassias MT (eds) Recent Theoretical Advances in Decentralized Distributed Convex Optimization. Springer, Cham, pp 253–325. https://doi.org/10.1007/978-3-031-00832-0_8
Kantorovich L (1942) On the translocation of masses. (Doklady) Acad Sci URSS (N.S.) 37:199–201
Google Scholar
Kovalev D, Gasanov E, Gasnikov A, Richtarik P (2021) Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over time-varying networks. Advances in Neural Information Processing Systems 34
Kovalev D, Shulgin E, Richtárik P, Rogozin AV, Gasnikov A (2021) ADOM: accelerated decentralized optimization method for time-varying networks. In: International Conference on Machine Learning, pp. 5784– 5793. PMLR
Krawtschenko R, Uribe CA, Gasnikov A, Dvurechensky P (2020) Distributed optimization with quantization for computing wasserstein barycenters. arXiv preprint arXiv:2010.14325
Kroshnin A, Tupitsa N, Dvinskikh D, Dvurechensky P, Gasnikov A, Uribe C (2019) On the complexity of approximating wasserstein barycenters. In: International Conference on Machine Learning, PMLR, pp. 3530– 3540
LeCun, Y (1998) The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/
Lemaréchal C, Sagastizábal C (1997) Practical aspects of the moreau-yosida regularization: theoretical preliminaries. SIAM J Optim 7(2):367–385
Article Google Scholar
Li H, Lin Z (2021) Accelerated gradient tracking over time-varying graphs for decentralized optimization. arXiv preprint arXiv:2104.02596
Monge G (1781) Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris
Peyré G, Cuturi M et al (2019) Computational optimal transport: with applications to data science. Found Trends® Mach Learn 11(5–6):355–607
Article Google Scholar
Rabin J, Peyré G, Delon J, Bernot M (2011) Wasserstein barycenter and its application to texture mixing. In: International Conference on Scale Space and Variational Methods in Computer Vision, Springer, pp. 435– 446
Rockafellar RT (1997) Convex analysis, vol 11. Princeton University Press, Princeton
Google Scholar
Rogozin A, Beznosikov A, Dvinskikh D, Kovalev D, Dvurechensky P, Gasnikov A (2021) Decentralized distributed optimization for saddle point problems. arXiv preprint arXiv:2102.07758
Rogozin A, Bochko M, Dvurechensky P, Gasnikov A, Lukoshkin V (2021) An accelerated method for decentralized distributed stochastic optimization over time-varying graphs. In: 2021 60th IEEE Conference on Decision and Control (CDC), pp. 3367– 3373 . https://doi.org/10.1109/CDC45484.2021.9683110
Staib M, Claici S, Solomon JM, Jegelka, S (2017) Parallel streaming wasserstein barycenters. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 2647– 2658. http://papers.nips.cc/paper/6858-parallel-streaming-wasserstein-barycenters.pdf
Uribe CA, Dvinskikh D, Dvurechensky P, Gasnikov A, Nedić A (2018) Distributed computation of Wasserstein barycenters over networks. In: 2018 IEEE Conference on Decision and Control (CDC), IEEE, pp. 6544– 6549
Uribe CA, Lee S, Gasnikov A, Nedić A (2020) A dual approach for optimal algorithms in distributed optimization over networks. In: 2020 Information Theory and Applications Workshop (ITA), IEEE, pp. 1– 37
Villani C (2009) Optimal transport: old and new, vol 338. Springer, Cham
Google Scholar
Wu X, Lu J (2019) Fenchel dual gradient methods for distributed convex optimization over time-varying networks. IEEE Trans Autom Control 64(11):4629–4636. https://doi.org/10.1109/TAC.2019.2901829
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to Alexander Rogozin for his feedback on the manuscript. This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.

Author information

Authors and Affiliations

N. N. Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Olga Yufereva
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
Michael Persiianov & Alexander Gasnikov
Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany
Pavel Dvurechensky
Skoltech, Moscow, Russia
Alexander Gasnikov
Institute of information transmission problems, Moscow, Russia
Alexander Gasnikov
Université catholique de Louvain (UCL), Louvain-la-Neuve, Belgium
Dmitry Kovalev

Authors

Olga Yufereva
View author publications
You can also search for this author in PubMed Google Scholar
Michael Persiianov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Dvurechensky
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gasnikov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Kovalev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

OY carried out the research and wrote all Sections except for introduction and the forth one, MP performed all the numerical experiments, AG proposed the main ideas and wrote Section 4, PD supervised the research and wrote the introduction, DK advised on possible methods. All authors reviewed the manuscript.

Corresponding author

Correspondence to Olga Yufereva.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A ADOM and its assumptions

The state-of-the-art numerical computation method for time-varying networks, called ADOM, is developed in Kovalev et al. (2021) and this subsection presents its main objects. It has natural restrictions on the class of suitable problems and, e.g., the Wasserstein barycenter problem lies beyond the requirements of this algorithm. So we modify ADOM to solve more general optimization problems with restrictions. For the sake of consistency, we slightly change the original notation and adduce below the results from Kovalev et al. (2021).

In Kovalev et al. (2021), the optimization problem with the consensus condition is

$$\begin{aligned}{} & {} \min \limits _{\textbf{x}\in {\mathcal {R}}} H(\textbf{x}) = \min \limits _{\textbf{x}\in {\mathcal {R}}} \sum \limits _{i=1}^{m} h_{i}([\textbf{x}]_i),\nonumber \\{} & {} \quad \text {where } {\mathcal {R}}= \left\{ \textbf{x}=([\textbf{x}]_1,\ldots ,[\textbf{x}]_m)\in ({\mathbb {R}}^d)^m \mid [\textbf{x}]_1 = \ldots =[\textbf{x}]_m\right\} , \end{aligned}$$

(10)

where functions $h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}$ are assumed to be smooth and strongly convex. Problem (10) is equivalent to the following:

$$\begin{aligned}{} & {} \min \limits _{\textbf{z}\in {\mathcal {R}}^{\perp }}H^*(\textbf{z}), \nonumber \\{} & {} \quad \text {where } {\mathcal {R}}^{\perp }= \left\{ \textbf{z}=([\textbf{z}]_1,\ldots ,[\textbf{z}]_m)\in ({\mathbb {R}}^d)^m \;\Big |\; \sum \limits _{i=1}^m [\textbf{z}]_i = 0\right\} , \end{aligned}$$

(11)

where $H^*$ is the Fenchel transform of the function H and ${\mathcal {R}}^{\perp }$ is the orthogonal complement of ${\mathcal {R}}$, that exists since $S = {\mathbb {R}}^d$ here.

Theorem 3

(Kovalev et al. 2021, Theorem 1) Let functions $h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}$ be L smooth and $\mu$ strongly convex, $\textbf{x}^*$ be the solution of the optimization problem (10), $\textbf{W}_n$ be a communication matrix at the n-th iteration satisfying Assumption 1. Set parameters $\alpha , \eta , \theta , \sigma ,\tau$ of Algorithm 2 to $\alpha = \frac{1}{2\,L}$, $\eta = \frac{2\lambda _{\min }^{+}\sqrt{\mu L}}{7\lambda _{\max }}$, $\theta = \frac{\mu }{\lambda _{\max }}$, $\sigma = \frac{1}{\lambda _{\max }}$, and $\tau = \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}$. Then there exists $C>0$, such that for Fenchel conjugate function $H^*(\textbf{z})$ from (11)

$$\begin{aligned} \left\| \nabla H^*(\textbf{z}_g^n) - \textbf{x}^*\right\| ^2_2 \le C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{\mu }{L}}\right) ^n. \end{aligned}$$

(12)

Remark 2

Addressing details of the proof of Theorem 1 of Kovalev et al. (2021) we see that there is a particular choice of the constant C, namely

$$\begin{aligned} C = \max \left\{ \frac{2\tau }{\mu ^2}, \frac{\tau (1-\tau )L}{\eta (1-\eta \alpha )\mu ^2} \right\} = \frac{1}{\mu ^2}\max \left\{ \frac{2\lambda _{\min }^{+}\sqrt{\mu }}{7\lambda _{\max }\sqrt{L}}, \frac{1}{2} \right\} = \frac{1}{2\mu ^2}. \end{aligned}$$

(13)

It means that the actual convergence rate is $n = {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\sqrt{\frac{L}{\mu }}\ln \frac{1}{\mu ^2\varepsilon }\right)$.

B Proof of Theorem 1

All the arguments below are applied under assumptions of Theorem 1, i.e. we assume that $S\subset {\mathbb {R}}^d$ is a convex set, $\textbf{x}\in {\mathcal {S}}$ is equivalent to $[\textbf{x}]_i\in S$ for all $i=1,\ldots ,m$, functions $f^{\gamma }_i:S\rightarrow {\mathbb {R}}$ are $\gamma$ strongly convex, and the output of Algorithm 1 is $\textbf{x}^n_{r,\gamma } = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)$. Denote also

$$\begin{aligned} \textbf{x}_{\gamma }^* = (x^*_{\gamma },\ldots ,x^*_{\gamma }) = \mathop {\mathrm {arg\,min}}\limits \limits _{\textbf{x}\in {\mathcal {S}}} F^{\gamma }(\textbf{x}) = \mathop {\mathrm {arg\,min}}\limits \limits _{x\in S} \sum \limits _{i=1}^mf_i^{\gamma }(x). \end{aligned}$$

1.1 B.1 Derivation of $(H^{r,\gamma })^*$

In brief, in this subsection we show that functions $h_i^{r,\gamma }$ from (14) are $\frac{1}{r}$ smooth, $\frac{\gamma }{1+r\gamma }$ strongly convex, and such that $\nabla (H^{r,\gamma })^*$ from Line 3 of Algorithm 1 is the gradient of the conjugate function $(H^{r,\gamma })^*$ of $H^{r,\gamma } = \sum \nolimits _{i=1}^m h_i^{r,\gamma }$ from (14). Then the consensus condition (4) becomes a corollary of Theorem 3 with $L = \frac{1}{r}$ and $\mu = \frac{\gamma }{1+r\gamma }$.

From now on let functions $h_i^{r,\gamma }:{\mathbb {R}}^d \rightarrow {\mathbb {R}}$ and $H^{r,\gamma }:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}$ be

$$\begin{aligned} \begin{array}{cc} H^{r,\gamma }(\textbf{x}) = \sum \limits _{i=1}^{m} h_i^{r,\gamma }([\textbf{x}]_i), \ \text{ where } \\ h_i^{r,\gamma }(x) = \inf \limits _{y\in S}\left\{ f_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} . \end{array} \end{aligned}$$

(14)

Define their conjugate as $(h_i^{r,\gamma })^*$ and $(H^{r,\gamma })^*$.

Lemma 1

If functions $h_i^{r,\gamma }$ and $H^{r,\gamma }$ are defined by (14), then their Fenchel conjugate functions $(h_i^{r,\gamma })^*$ and $(H^{r,\gamma })^*:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}$ are

$$\begin{aligned} \begin{array}{cc} (H^{r,\gamma })^*(\textbf{z}) = \sum \limits _{i=1}^{m} (h_i^{r,\gamma })^*([\textbf{z}]_i), \ \text{ where } \\ (h_i^{r,\gamma })^*(z) = (f_i^{\gamma })^*(z) + \frac{r}{2}\Vert z\Vert ^2_2. \end{array} \end{aligned}$$

Moreover, its conjugate $(H^{r,\gamma })^{**}$ coincides with $H^{r,\gamma }$.

Proof

The definition (14) is similar to Moreau–Yosida smoothing, but the tricky point is that the functions $f_i^{\gamma }$ are defined on a convex set S instead of the ${\mathbb {R}}^d$. Let us introduce functions ${\tilde{f}}_i^{\gamma }$ with domain ${\mathbb {R}}^d$ as follows:

$$\begin{aligned} {\tilde{f}}_i^{\gamma }(x)=\left\{ \begin{array}{cc} &{}f_i^{\gamma }(x) \quad \hbox { if}\ x\in S\\ &{} +\infty , \quad \text{ otherwise }. \end{array}\right. \end{aligned}$$

(15)

Such ${\tilde{f}}_i^{\gamma }$ are $\gamma$ strongly convex as well. Moreover, substitution ${\tilde{f}}_i^{\gamma }$ for $f_i^{\gamma }$ affect neither primal $h_i^{r,\gamma }$:

$$\begin{aligned} h_i^{r,\gamma }(x) = \inf \limits _{y\in S}\left\{ f_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} = \inf \limits _{y\in {\mathbb {R}}^d}\left\{ {\tilde{f}}_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} ,\end{aligned}$$

nor $(f_i^{\gamma })^*(z) + \frac{r}{2}\Vert z\Vert ^2_2$:

$$\begin{aligned} (f^{\gamma }_i)^*(z) + \frac{r}{2}\Vert z\Vert ^2_2&=\max \limits _{x \in S}\left\{ \left\langle z, x\right\rangle - f^{\gamma }_i(x)\right\} + \frac{r}{2}\Vert z\Vert ^2_2 \\&=\max \limits _{x \in {\mathbb {R}}^d}\left\{ \left\langle z, x\right\rangle - {\tilde{f}}^{\gamma }_i(x)\right\} + \frac{r}{2}\Vert z\Vert ^2_2 \\&=({\tilde{f}}^{\gamma }_i)^*(z)+ \frac{r}{2}\Vert z\Vert ^2_2. \end{aligned}$$

For each i one can see that $(h_i^{r,\gamma })^* = (f^{\gamma }_i)^*(z) + \frac{r}{2}\Vert z\Vert ^2_2$ is the Fenchel conjugate of $h_i^{r,\gamma }$ and vice versa. Indeed, for proper, convex and lower semicontinuous $g_1, g_2:{\mathbb {R}}^d \rightarrow {\mathbb {R}}$ we have $(g_1+ g_2)^*(x) = g_1^* \square g_2^*$ and $(g_1 \square g_2)^* = g_1^* + g_2^*$, where $(g_1\square g_2)(x)$ means the convolution $\inf \{g_1(y) + g_2(x-y) \mid y\in {\mathbb {R}}^d \}$.

Hence the Fenchel conjugate for the function $H^{r,\gamma }$ will be

$$\begin{aligned}{} & {} \sup \limits _{\textbf{x}\in ({\mathbb {R}}^d)^m} \left\{ \langle \textbf{z},\textbf{x}\rangle - H^{r,\gamma }(\textbf{x})\right\} \nonumber \\{} & {} \quad = \sup \limits _{\textbf{x}\in ({\mathbb {R}}^d)^m} \left\{ \sum \limits _{i=1}^{m} \left( \langle [\textbf{z}]_i,[\textbf{x}]_i \rangle - h_{i}^{r,\gamma }([\textbf{x}]_i)\right) \right\} \nonumber \\{} & {} \quad =\sum \limits _{i=1}^{m} \sup \limits _{[\textbf{x}]_i\in {\mathbb {R}}^d} \left\{ \langle [\textbf{z}]_i,[\textbf{x}]_i \rangle - h_{i}^{r,\gamma }([\textbf{x}]_i)\right\} \nonumber \\{} & {} \quad = \sum \limits _{i=1}^{m} (h_{i}^{r,\gamma })^*([\textbf{z}]_i) = (H^{r,\gamma })^*(\textbf{z}). \end{aligned}$$

(16)

In the same way one can see that $H^{r,\gamma }$ and $(H^{r,\gamma })^{**}$ coincide. $\square$

Remark 3

For each i the function $\left( h^{r,\gamma }_{i}\right) ^*$ from (14) is $\left( \frac{1}{\gamma }+r\right)$ smooth and r strongly convex by definition, so we have $h^{r,\gamma }_i = (h^{r,\gamma }_i)^{**}$ being $\frac{1}{r}$ smooth and $\frac{\gamma }{1+r\gamma }$ strongly convex. In addition

$$\begin{aligned}\nabla (h^{r,\gamma }_i)^*(z) = \nabla (f_i^{\gamma })^*(z) + z \end{aligned}$$

as stated in Line 3 of Algorithm 1. Then we can apply Algorithm 2 for $L = r^{-1}$ smooth and $\mu = \frac{\gamma }{1+r\gamma }$ strongly convex functions $h^{r,\gamma }_i$ and get the values of $\nabla (h^{r,\gamma }_i)^*(z)$ as output.

Thus we construct a relaxation $\min _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})$ of the constrained convex optimization problem $\min _{\textbf{x}\in {\mathcal {S}}}F^{\gamma }(\textbf{x})$.

Corollary 2

Let a function $H^{r,\gamma }$ be defined in (14) and let $\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})$. Then applying Algorithm 2 for

$$\begin{aligned}\nabla (h^{r,\gamma }_i)^*(z) = (f_i^{\gamma })^*(z) + rz\end{aligned}$$

we get by Theorem 3

$$\begin{aligned} \left\| \textbf{x}^{*}_{r,\gamma } - \textbf{x}^{n}_{r,\gamma }\right\| ^2_2\le C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n, \end{aligned}$$

(17)

where $\textbf{x}_{r,\gamma }^n = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)$ and

$$\begin{aligned}C = \frac{(1+r\gamma )^2}{2\gamma ^2}.\end{aligned}$$

Moreover, since $\textbf{x}^*_{r,\gamma } \in {\mathcal {R}}$, i.e. $[\textbf{x}^*_{r,\gamma }]_i = [\textbf{x}^*_{r,\gamma }]_j$ for all i and j, the consensus condition is approximated as follows

$$\begin{aligned} \left\| \left[ \textbf{x}^{n}_{r,\gamma }\right] _i - \left[ \textbf{x}^{n}_{r,\gamma }\right] _j\right\| ^2_2\le 2C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n. \end{aligned}$$

1.2 B. 2 Value bounds on $H^{r,\gamma }$

Despite we defined $h_i^{r,\gamma }$ for all ${\mathbb {R}}^d$, some properties hold true on the initial set S only.

Lemma 2

Let functions $h_i^{r,\gamma }$ be defined in (14). If $x\in S$, then for any $r>0$, for each $i = 1,\ldots ,m$ we have

$$\begin{aligned} \begin{array}{rl} f^{\gamma }_i(x) - \frac{r}{2(1+r\gamma )}\left\| \nabla f^{\gamma }_i(x)\right\| ^2_2 \le h_i^{r,\gamma }(x) \le f^{\gamma }_i(x). \end{array} \end{aligned}$$

(18)

Proof

The second inequality directly follows from the definition (14). To prove the first one we recall that $f^{\gamma }_i$ is $\gamma$ strongly convex and the following holds:

$$\begin{aligned} h_i^{r,\gamma }(x)= & {} \inf \limits _{y \in S} \left\{ f^{\gamma }_i(y) + (2r)^{-1}\Vert x-y\Vert ^2_2\right\} \\= & {} \inf \limits _{y: \ (x-y)\in S} \left\{ f^{\gamma }_i(x-y) + (2r)^{-1}\Vert y\Vert ^2_2\right\} \\\ge & {} \inf \limits _{y: \ (x-y)\in S} \left\{ f^{\gamma }_i(x) + \langle \nabla f^{\gamma }_i(x), -y \rangle + \gamma /2 \Vert y\Vert ^2_2 + (2r)^{-1}\Vert y\Vert ^2_2\right\} \\\ge & {} \inf \limits _{y\in {\mathbb {R}}^d} \left\{ f^{\gamma }_i(x) + \langle \nabla f^{\gamma }_i(x), -y \rangle + \gamma /2 \Vert y\Vert ^2_2 + (2r)^{-1}\Vert y\Vert ^2_2\right\} , \end{aligned}$$

which reaches its minimum at $y=\frac{r}{1+r\gamma }\nabla f^{\gamma }_i(x)$ and so equals to

$$\begin{aligned}{} & {} f^{\gamma }_i(x) + \frac{r}{1+r\gamma } \langle -\nabla f^{\gamma }_i(x), \nabla f^{\gamma }_i(x) \rangle + \frac{r}{2(1+r\gamma )}\Vert \nabla f^{\gamma }_i(x)\Vert ^2_2 \\{} & {} \quad \quad \quad \quad =f^{\gamma }_i(x) - \frac{r}{2(1+r\gamma )}\Vert \nabla f^{\gamma }_i(x)\Vert ^2_2. \end{aligned}$$

$\square$

1.3 B.3 Convergence in argument

Lemma 3 shows convergence in argument in the following sense: if the regularization parameter r tends to zero, the argminimum $\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}$ of $H^{r,\gamma }$ tends to the argminimum $\textbf{x}^*_{\gamma }\in {\mathcal {S}}$ of $F^{\gamma }$. By Corollary 2 we have $\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}$ approximated by $\textbf{x}^n_{r,\gamma }\in ({\mathbb {R}}^d )^m$ for a sufficient number of iterations n.

Lemma 3

Let $\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}\in {\mathcal {R}}} H^{r,\gamma }(\textbf{x})$ for $H^{r,\gamma }$ defined in (14). Let

$$\begin{aligned} \Vert \nabla F^{\gamma }(\textbf{x})\Vert _2^2 \le m K_{\zeta }^2 \quad \forall \textbf{x}\in \{\textbf{y}\in {\mathcal {S}}\mid \Vert \textbf{y}-\textbf{x}^*_{\gamma }\Vert _2\le \zeta \}. \end{aligned}$$

(19)

If r is such that $\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta$, then

$$\begin{aligned} \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \sqrt{\frac{rm}{2\gamma }}K_{\zeta }. \end{aligned}$$

(20)

Proof

Using (18) and strong convexity of $F^{\gamma }$ and $H^{r,\gamma }$ we have

$$\begin{aligned} F^{\gamma } (\textbf{x}^*_{\gamma }){} & {} \ge H^{r,\gamma }(\textbf{x}^*_{\gamma }) = \sum h_i^{r,\gamma }([\textbf{x}^*_{\gamma }]_i) \\{} & {} \ge \sum \limits _{i=1}^m \left( h_i^{r,\gamma }(\textbf{x}^*_{r,\gamma }) + \frac{\gamma }{2(1+r\gamma )}\Vert [\textbf{x}^*_{r,\gamma }]_i - [\textbf{x}^*_{\gamma }]_i\Vert _2^2\right) \\{} & {} \quad = H^{r,\gamma }(\textbf{x}^*_{r,\gamma }) + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{r,\gamma }) - \frac{r}{2(1+r\gamma )}\Vert \nabla F^{\gamma }(\textbf{x}^*_{r,\gamma })\Vert ^2_2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{r,\gamma }) - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{\gamma }) + \gamma /2 \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{\gamma }) + \frac{\gamma }{1+r\gamma } \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2. \end{aligned}$$

Then $\frac{\gamma }{1+r\gamma } \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2\le 0$ and hence $\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert ^2_2 \le \frac{rm}{2\gamma }K^2_{\zeta }$. $\square$

Combining Lemma 3 with Corollary 2 we get the following.

Remark 4

Let $\zeta >0$ and let $K_{\zeta }$ be such that (19) holds. If

$$\begin{aligned} \sqrt{\frac{rm}{2\gamma }}K_{\zeta }+\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \zeta , \end{aligned}$$

where $C_1 = \frac{(1+r\gamma )^2}{2\gamma ^2}$, then both $\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta$ and $\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta$ hold.

1.4 B.4 Value approximation

Let $\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}$ be the only argminimum of $H^{r,\gamma }$ on the consensus space ${\mathcal {R}}$, i.e.

$$\begin{aligned} \begin{array}{cc} \textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\limits \limits _{\textbf{x}\in {\mathcal {S}}} H^{r,\gamma }(\textbf{x}). \end{array} \end{aligned}$$

(21)

In order to prove the value approximation (5) let us separate it into parts and estimate each of them:

$$\begin{aligned}{} & {} F^{\gamma }(\textbf{x}^{n}_{r,\gamma }) - F^{\gamma }(\textbf{x}^*_{\gamma }) \end{aligned}$$

(22a)

$$\begin{aligned}{} & {} \quad \le F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \end{aligned}$$

(22b)

$$\begin{aligned}{} & {} \quad + H^{r,\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) \end{aligned}$$

(22c)

$$\begin{aligned}{} & {} \quad + H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }). \end{aligned}$$

(22d)

The last addend is negative and can be eliminated:

$$\begin{aligned} H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }) \le H^{r,\gamma }(\textbf{x}^{*}_{\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }) \le 0. \end{aligned}$$

The rest are estimated in Lemmas 4 and 5 under additional assumptions.

Lemma 4

Let $\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta$. If (19) holds, then

$$\begin{aligned} F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \le \frac{r}{2(1+r\gamma )}mK^2_{\zeta }. \end{aligned}$$

(23)

Proof

We cannot declare a uniform K instead of $K_{\zeta }$ because $F^{\gamma }$ is not smooth. Nonetheless, assuming $\textbf{x}^{n}_{r,\gamma }$ belong to $\zeta$-neighborhood of $\textbf{x}^*_{\gamma }$, we immediately obtain from (18) and (19) that

$$\begin{aligned} F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \le \frac{r}{2(1+r\gamma )}\Vert \nabla F^{\gamma }(\textbf{x}^{n}_{r,\gamma })\Vert ^2_2 \le \frac{r}{2(1+r\gamma )}mK^2_{\zeta }. \end{aligned}$$

$\square$

Lemma 5

Let (19) holds. Then

$$\begin{aligned} H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^*_{r,\gamma })&\le C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}, \\ \text{ where } \quad C_2&= \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{m(1+r\gamma )^2}{4r\gamma ^2}. \end{aligned}$$

Proof

By $\frac{m}{r}$ smoothness of $H^{r,\gamma }$

$$\begin{aligned} \begin{array}{rl} H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^*_{r,\gamma }) &{}\le \langle \nabla H^{r,\gamma }\left( \textbf{x}^*_{r,\gamma }\right) ,\textbf{x}^{n}_{r,\gamma } - \textbf{x}^{*}_{r,\gamma }\rangle + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2 \\ &{}\le \langle \nabla H^{r,\gamma }\left( \nabla (H^{r,\gamma })^*(\textbf{z}_g^{\infty })\right) ,\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) - \textbf{x}^{*}_{r,\gamma }\rangle \\ &{}\quad + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2 \\ &{}\le \langle \textbf{z}_g^{\infty }, \nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) - \textbf{x}^{*}_{r,\gamma }\rangle + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2, \end{array} \end{aligned}$$

where $\textbf{z}^{\infty }_{g}$ is the limit of $\textbf{z}_g^n$ and so it is the argminimum of $(H^{r,\gamma })^*$ on ${\mathcal {R}}^{\perp }$. By (17) we have

$$\begin{aligned}{} & {} \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{r,\gamma }\Vert ^2_2 \le \frac{m}{2r}C_1\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \\{} & {} \quad = \frac{m(1+r\gamma )^2}{4r\gamma ^2}\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n\end{aligned}$$

Let us introduce an orthogonal projection matrix $\textbf{P}$ onto the subspace ${\mathcal {R}}^{\perp }$, i.e., it holds $\textbf{P}v = \mathop {\mathrm {arg\,min}}\limits _{z \in {\mathcal {R}}^{\perp }} \{v - z\}$ for an arbitrary $v \in ({\mathbb {R}}^d)^n$. Then matrix $\textbf{P}$ is

$$\begin{aligned} \textbf{P}= \left( \textbf{I}_n - \frac{1}{n}\textbf{1}_n\textbf{1}_n^{\top }\right) \otimes \textbf{I}_d, \end{aligned}$$

(24)

where $\textbf{I}_n$ denotes $n\times n$ identity matrix, $\textbf{1}_n = (1,\ldots ,1)\in {\mathbb {R}}^n$, and $\otimes$ is a Kronecker product. Note that $\textbf{P}^{\top }\textbf{P}= \textbf{P}$.

Since $\textbf{z}_g^{\infty }\in {\mathcal {R}}^{\perp }$ and $\textbf{x}^{*}_{r,\gamma }\in {\mathcal {R}}$, the first part simplifies to $\langle \textbf{z}_g^{\infty }, \textbf{P}\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) \rangle$. We may use Lemma 2 in Kovalev et al. (2021) to get the following estimation

$$\begin{aligned} \begin{array}{rl} \Vert \textbf{P}\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n})\Vert _2^2 = \Vert \nabla (H^{r,\gamma })^*(\textbf{z}_g^{n})\Vert _{\textbf{P}}^2 \le \frac{2}{\theta \lambda _{\min }^{+}}\left( (H^{r,\gamma })^*(\textbf{z}^n_g) - (H^{r,\gamma })^*(\textbf{z}^{n+1}_f)\right) . \end{array} \end{aligned}$$

As $\textbf{z}^{n+1}_f$ is a non-optimal point of Algorithm 1, this is not greater than

$$\begin{aligned}{} & {} \frac{2}{\theta \lambda _{\min }^{+}}\left( (H^{r,\gamma })^*(\textbf{z}^n_g) - (H^{r,\gamma })^*(\textbf{z}^*)\right) \\{} & {} \quad \le \frac{m(1+r\gamma )}{\gamma \theta \lambda _{\min }^{+}}\left\| \textbf{z}^n_g - \textbf{z}^*\right\| _2^2 = \frac{m(1+r\gamma )^2}{\gamma ^2}\frac{\lambda _{\max }}{\lambda _{\min }^{+}}\left\| \textbf{z}^n_g - \textbf{z}^*\right\| _2^2 \\{} & {} \quad \le \frac{m(1+r\gamma )^2}{2\gamma ^2}\frac{\lambda _{\max }}{\lambda _{\min }^{+}} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \end{aligned}$$

and the latter ones follow from the $\frac{m(1+r\gamma )}{\gamma }$ smoothness of $(H^{r,\gamma })^*$ and from the fact that the proof of (Kovalev et al. 2021, Theorem 1) actually covers the following chain of inequalities:

$$\begin{aligned} \left\| \nabla H^*(\textbf{z}_g^n) - \textbf{x}^*\right\| ^2_2 \le \frac{1}{\mu ^2}\left\| \textbf{z}^n_g - \textbf{z}^* \right\| ^2_2 \le C \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\right) ^n = \frac{1}{2\mu ^2} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\right) ^n. \end{aligned}$$

By our assumption $\Vert \textbf{z}^{\infty }_{g}\Vert _2 = \Vert \nabla H^{r,\gamma }(x_{r,\gamma }^*)\Vert _2< \sqrt{m}K_{\zeta }$. Thus, we obtain

$$\begin{aligned}{} & {} (H^{r,\gamma })^*(\textbf{x}^n_{r,\gamma }) - (H^{r,\gamma })^*(\textbf{x}^*_{r,\gamma }) \\{} & {} \quad \le \sqrt{m}K_{\zeta }\frac{\sqrt{m}(1+r\gamma )}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2} \\{} & {} \qquad + \frac{m(1+r\gamma )^2}{4r\gamma ^2}\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \\{} & {} \quad \le \left( \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{m(1+r\gamma )^2}{4r\gamma ^2}\right) \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}. \\{} & {} \quad = C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}. \end{aligned}$$

$\square$

1.5 B.5 Final compilation

This section completes the proof of Theorem 1 and shows Remark 1.

Recall that where $C_1 = \frac{(1+r\gamma )^2}{2\gamma }$ and

$$\begin{aligned} C_2 = \frac{m}{2r}C_1 + \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\frac{\lambda _{\max }}{\lambda _{\min }^{+}} = \frac{m(1+r\gamma )^2}{4r\gamma } + \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\frac{\lambda _{\max }}{\lambda _{\min }^{+}}. \end{aligned}$$

By Remark 4 and Lemmas 4, 5 we see that $F^{\gamma }(\textbf{x}^{n}_{r,\gamma }) - F^{\gamma }(\textbf{x}^*_{\gamma })< \varepsilon$ if

$$\begin{aligned} \forall \textbf{x}\in \{\textbf{y}\in \mid \Vert \textbf{y}-\textbf{x}_{\gamma }^*\Vert _2<\zeta \} \qquad \Vert \nabla F^{\gamma }(\textbf{x})\Vert _2^2&<mK^2_{\zeta }, \end{aligned}$$

(25)

$$\begin{aligned} \sqrt{\frac{rm}{2\gamma }}K_{\zeta }+\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}&\le \zeta , \end{aligned}$$

(26)

$$\begin{aligned} \frac{r}{2(1+r\gamma )}mK^2_{\zeta }&\le \varepsilon /2, \end{aligned}$$

(27)

$$\begin{aligned} C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}&\le \varepsilon /2. \end{aligned}$$

(28)

Let $\zeta =\sqrt{\varepsilon / \gamma }$ and let $r\le \frac{\varepsilon }{2mK^2_{\zeta }}$. Then (27) holds. If (28) fulfills, then (26) follows from (27) and (28) as $\sqrt{\frac{rm}{2\gamma }}K_{\zeta }\le \sqrt{\frac{\varepsilon }{2\gamma }} \le \zeta /\sqrt{2}$ and $\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \zeta /2$ since $1\le \sqrt{C_1}\le C_1\le C_2$ and $\varepsilon \le \sqrt{\varepsilon /\gamma } = \zeta$. Thus, it suffices to assume

$$\begin{aligned}&\forall i \quad \forall x\in \lbrace y\in S \mid \Vert y-x_{\gamma }^*\Vert _2^2\le \varepsilon / \gamma \rbrace \qquad \Vert \nabla f_i^{\gamma }(x)\Vert _2 \le K, \\&\quad r\le \frac{\varepsilon }{2mK^2}, \\&\quad C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2} \le \varepsilon /2. \end{aligned}$$

So $\varepsilon$ approximation requires a number of iteration

$$\begin{aligned} {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\sqrt{\frac{1+r\gamma }{r\gamma }}\ln \frac{C_2}{\varepsilon }\right) = {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\frac{1}{\sqrt{\gamma \varepsilon }}\ln \frac{1}{\varepsilon }\right) . \end{aligned}$$

C Proof of Theorem 2

To prove Theorem 2 we combine proved Theorem 1 with features of the entropy regularization of the Wasserstein barycenter problem.

1.1 C.1 Entropy regularized WB problem

Recall that for a fixed cost matrix M we define the set of transport plans as

$$\begin{aligned} U(p,q) := \left\{ X \in {\mathbb {R}}_+^{d \times d} \mid X \textbf{1}= p, X^T\textbf{1}= q \right\} \end{aligned}$$

and Wasserstein distance between two probability distributions p and q as

$$\begin{aligned}{\mathcal {W}}(p,q):= \min _{X \in U(p,q)} \langle M, X \rangle . \end{aligned}$$

The entropy regularized (or smoothed) Wasserstein distance is defined as

$$\begin{aligned} {\mathcal {W}}_{\gamma } (p,q) := \min _{X \in U(p,q)} \left\{ \langle M, X \rangle - \gamma E(X)\right\} , \end{aligned}$$

(29)

where $\gamma >0$ and

$$\begin{aligned} E(X) := - \sum _{i=1}^{d} \sum _{j=1}^{d} e(X_{ij}), \nonumber \\ \text {where } e(x)=\left\{ \begin{array}{ll} x\ln x \quad &{} \text{ if } x>0 \\ 0 \quad &{} \text{ if } x=0. \end{array}\right. \end{aligned}$$

(30)

So it seeks to minimize the transportation costs while maximizing the entropy. Moreover ${\mathcal {W}}_{\gamma }(p,q)\rightarrow {\mathcal {W}}(p,q)$ as $\gamma \rightarrow 0$.

Then the convex optimization problem (7) can be relaxed to the following $\gamma$ strongly convex optimization problem

$$\begin{aligned} \min _{p \in S_1(d)}\sum \limits _{i=1}^{m} {\mathcal {W}}_{\gamma ,q_i}(p), \end{aligned}$$

(31)

where ${\mathcal {W}}_{\gamma ,q_i}(p) = {\mathcal {W}}_{\gamma }(q_i, p)$. The argminimum of (31) is called the uniform Wasserstein barycenter (Agueh and Carlier 2011; Cuturi and Doucet 2014) of the family of $q_1,\ldots , q_m$. Moreover, problem (31) admits a unique solution and approximates the unregularized WB problem as follows.

Remark 5

Let $\gamma \le \frac{\varepsilon }{4}\ln d$. If vectors $\hat{p}_i\in S_1(d)$ are such that

$$\begin{aligned}\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(\hat{p}_i) - \min \limits _{p\in S_1(d)}\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(p) \le \frac{\varepsilon }{2},\end{aligned}$$

then

$$\begin{aligned}\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(\hat{p}_i) - \min \limits _{p\in S_1(d)}\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(p)\le \varepsilon .\end{aligned}$$

Indeed, as entropy is bounded we have ${\mathcal {W}}_{q_i}(p)\le {\mathcal {W}}_{\gamma , q_i}(p)\le {\mathcal {W}}_{q_i}(p) + 2\gamma \ln d$ for all i and p. Then, for $p^* = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{q_i}(p)$ and $p^*_{\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p)$ it holds that

$$\begin{aligned}&\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(\hat{p}_i) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(p^*) \\&\quad \le \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(\hat{p}) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p^*) + 2\gamma \ln d \\&\quad \le \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(\hat{p}) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p^*_{\gamma }) + \frac{\varepsilon }{2} \le \varepsilon . \end{aligned}$$

1.2 C.2 Legendre transforms

One particular advantage of entropy regularization of the Wasserstein distance is that it yields closed-form representations for the dual function ${\mathcal {W}}^*_{\gamma , q}(\cdot )$ and for its gradient. Recall that the Fenchel-Legendre transform of (29) is defined as

$$\begin{aligned} {\mathcal {W}}^*_{\gamma ,q}(z)&:= \max _{p \in S_1(d)}\left\{ \left\langle z,p\right\rangle - { {\mathcal {W}}_{\gamma ,q} (p)}\right\} . \end{aligned}$$

(32)

Theorem 4

((Cuturi and Peyré 2015, Theorem 2.4)) For $\gamma >0$, the Fenchel-Legendre dual function ${\mathcal {W}}^*_{\gamma ,q}(z)$ is differentiable

$$\begin{aligned} \begin{array}{rl} {\mathcal {W}}^*_{\gamma ,q}(z) &{} = \gamma \left( E(q) + \left\langle q, \ln {\mathcal {K}}\alpha \right\rangle \right) \\ {} &{}= - \gamma \left\langle q,\ln q \right\rangle + \gamma \sum \limits _{j=1}^{m} [q]_j \ln \left( \sum \limits _{i=1}^{m}\exp \left( \frac{1}{\gamma }\left( [z]_i-M_{ji}\right) \right) \right) \end{array} \end{aligned}$$

(33)

and its gradient $\nabla {\mathcal {W}}^*_{\gamma ,q}(z)$ is $1/\gamma$-Lipschitz in the 2-norm with

$$\begin{aligned} \begin{array}{rl} \nabla {\mathcal {W}}^*_{\gamma ,q}(z) &{}= \alpha \circ \left( {\mathcal {K}}\cdot {q}/({{\mathcal {K}}\alpha })\right) \in S_1(d), \\ \left[ \nabla {\mathcal {W}}^*_{\gamma ,q}(z) \right] _l &{} =\sum \limits _{j=1}^{m} [q]_j \frac{\exp \left( \frac{1}{\gamma }([z]_l - M_{lj})\right) }{\sum \limits _{i=1}^{m}\exp \left( \frac{1}{\gamma }([z]_i - M_{ij})\right) }. \end{array} \end{aligned}$$

(34)

where $z \in {\mathbb {R}}^n$ and for brevity we denote $\alpha = \exp ( {z}/{\gamma })$ and ${\mathcal {K}}= \exp \left( {-M}/{\gamma }\right)$.

Notice that to get back and obtain the approximated barycenter we can employ the following result (with $\lambda _i = 1$).

Theorem 5

(Cuturi and Peyré (2015), Theorem 3.1) The barycenter $p^*$ solving (31) satisfies

$$\begin{aligned} \forall i=1,\ldots ,m \qquad p^* =\nabla {\mathcal {W}}^*_{\gamma ,q_i}(z^*_i), \end{aligned}$$

where the set of $z^*_i$ constitutes any solution of any smoothed dual WB problem:

$$\begin{aligned} \min \limits _{z_1,\ldots ,z_m\in {\mathbb {R}}^d} \sum \limits _{i=1}^{m}\lambda _i{\mathcal {W}}^*_{\gamma ,q_i}(z_i) \quad \text{ s.t. } \quad \sum \limits _{i=1}^{m}\lambda _i z_i = 0. \end{aligned}$$

Thus we can apply Theorem 1 for the problem (31) with explicitly defined $\nabla {\mathcal {W}}^*_{\gamma , q_i}$ and obtain $\textbf{x}^n_{r,\gamma }$ that satisfies

$$\begin{aligned}&\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}([\textbf{x}^n_{r,\gamma }]_i) - \min \limits _{p\in S_1(d)} \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(p) \\&\quad \le \frac{r}{4(1+r\gamma )}mK^2 + \frac{1}{2}C_2\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \varepsilon /2. \end{aligned}$$

By Remark 5 it proves

$$\begin{aligned}&\left| \sum \limits _{i=1}^m {\mathcal {W}}_{q_i}([\textbf{x}^{n}_{r,\gamma }]_i) - \sum \limits _{i=1}^m {\mathcal {W}}_{q_i}([\textbf{p}^*]_i)\right| \\&\quad \le 2\gamma \ln d + \frac{r}{4(1+r\gamma )}mK^2 + C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n}/2\le \varepsilon , \end{aligned}$$

for $C=\frac{1}{2}C_2 = \frac{(1+r\gamma )mK_{\zeta }}{2\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{(1+r\gamma )^2}{8r\gamma ^2}$.

1.3 C.3 Parameter estimation

It remains to assign $\zeta >0$ and $K = K_{\zeta }$ satisfying (25). Due to Assumption 2 such $\zeta$ and K exist.

Proposition 1

Let a set $\{q_i\}_{i=1}^m$ satisfies Assumption 2, let $p^*_{\gamma }$ be the uniform Wasserstein barycenter of $\{q_i\}_{i=1}^m$, and let $\zeta \in \left( 0, \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\right)$. For each $i = 1,\ldots ,m$ the norm of the gradient $\Vert \nabla {\mathcal {W}}_{\gamma ,q_i}(\cdot )\Vert _2^2$ is uniformly bounded over $\{p\in S_1(d)\mid \Vert p-p^*_{\gamma } \Vert _2^2 \le \zeta \};$ and the bound $K_{\rho }$ is given in (35) for $\rho \le \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}-\zeta .$

We obtain Proposition 1 as a combination of Lemma 6 from Bigot et al. (2019) and proved below Lemma 7.

Lemma 6

(Bigot et al. (2019), Lemma 3.5) For any $\rho \in (0,1)$, $q\in S_1(d)$, and $p\in \{x\in S_1(d)\mid \min _l x_l\ge \rho \}$ there is a bound: $\Vert \nabla {\mathcal {W}}_{\gamma , q}(p) \Vert ^2_2\le K_{\rho }$, where

$$\begin{aligned} K_{\rho } = \sum \limits _{j=1}^{d}\left( 2\gamma \ln d + \inf _i\sup _l |M_{jl} - M_{il}| - \gamma \ln \rho \right) ^2. \end{aligned}$$

(35)

Lemma 7

Let a set $\{q_i\}_{i=1}^m$ satisfies Assumption 2, let $p^*_{\gamma }$ be the uniform Wasserstein barycenter of $\{q_i\}_{i=1}^m$. All components k of $p^*_{\gamma }$ have a uniform positive lower bound: $[p^*_{\gamma }]_k\ge \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}$.

Proof

Let $X^*_i$ denote the optimal transport plan between $p^*_{\gamma }$ and $q_i$. Assume the contrary: there is k such that $[p^*_{\gamma }]_k < \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}$. Then there is another component n such that $[p^*_{\gamma }]_n>\min _i [q_i]_n > \min _{i,l} [q_i]_l$. Consider the vector p that consists of $[p]_i = [p^*_{\gamma }]_i$ except for the components $[p]_n = [p^*_{\gamma }]_n+\delta$ and $[p]_l = [p^*_{\gamma }]_l-\delta$, where $\delta >0$ is less than $\min _{i,a\not =b}[X_i^*]_{a,b}$ of the optimal transport plans $X^*_i$ between $p^*_{\gamma }$ and $q_n$. Because of the entropy, all these optimal transport plans contain only positive non-diagonal elements, so such a $\delta$ exists.

Construct now non-optimal transport plans between p and each of $q_i$ in order to get the contradiction with the assumption. Initially we have ${\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma }) = \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i$. Consider the matrix $X_i$ that differs from $X^*_i$ only at four elements:

$$\begin{aligned}{}[X_i]_{kk} = [X_i^*]_{kk} +\frac{1}{2}\delta , \quad [X_i]_{kn} = [X_i^*]_{kn} +\frac{1}{2}\delta , \\ [X_i]_{nn} = [X_i^*]_{nn} +\frac{1}{2}\delta , \quad [X_i]_{nk} = [X_i^*]_{nk} +\frac{1}{2}\delta . \end{aligned}$$

Then $X_i$ is a transport plan between p and $q_i$ since its elements are positive and also $X_i \textbf{1}= p$ and $X_i^{\top }\textbf{1}= q_i$. Using the monotonicity of entropy on the interval $(0,\frac{1}{e})$ and the assumption that diagonal elements of the cost matrix C are zero, we get for each i:

$$\begin{aligned} \begin{array}{rl} {\mathcal {W}}_{\gamma ,q_i}(p) \le &{} \langle C, X_i \rangle - \gamma X_i \ln X_i \\ =&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i +\frac{1}{2}\delta C_{kn}-\frac{1}{2}\delta C_{nk} \\ +&{} ([X_i]_{kk}\ln [X_i]_{kk} - [X^*_i]_{kk}\ln [X^*_i]_{kk}) \\ +&{} ([X_i]_{kn}\ln [X_i]_{kn} - [X^*_i]_{kn}\ln [X^*_i]_{kn}) \\ +&{} ([X_i]_{nk}\ln [X_i]_{nk} - [X^*_i]_{nk}\ln [X^*_i]_{nk}) \\ +&{} ([X_i]_{nn}\ln [X_i]_{nn} - [X^*_i]_{nn}\ln [X^*_i]_{nn}) \\ <&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i +\frac{1}{2}\delta C_{kn}-\frac{1}{2}\delta C_{nk} \\ =&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i = {\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma }). \end{array} \end{aligned}$$

The obtained inequalities ${\mathcal {W}}_{\gamma ,q_i}(p)<{\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma })$ contradict to the fact that $p^*_{\gamma }$ is the barycenter; this proves the lemma. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yufereva, O., Persiianov, M., Dvurechensky, P. et al. Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters. Comput Manag Sci 21, 12 (2024). https://doi.org/10.1007/s10287-023-00493-9

Download citation

Received: 29 June 2023
Accepted: 20 November 2023
Published: 16 December 2023
DOI: https://doi.org/10.1007/s10287-023-00493-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters

Abstract

Access this article

Similar content being viewed by others

An away-step Frank–Wolfe algorithm for constrained multiobjective optimization

An elementary proof of the Brouwer’s fixed point theorem

Multi-step inertial algorithms for equilibrium, fixed point, general systems of variational inequalities and split feasibility problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Appendices

A ADOM and its assumptions

Theorem 3

Remark 2

B Proof of Theorem 1

1.1 B.1 Derivation of \((H^{r,\gamma })^*\)

Lemma 1

Proof

Remark 3

Corollary 2

1.2 B. 2 Value bounds on \(H^{r,\gamma }\)

Lemma 2

Proof

1.3 B.3 Convergence in argument

Lemma 3

Proof

Remark 4

1.4 B.4 Value approximation

Lemma 4

Proof

Lemma 5

Proof

1.5 B.5 Final compilation

C Proof of Theorem 2

1.1 C.1 Entropy regularized WB problem

Remark 5

1.2 C.2 Legendre transforms

Theorem 4

Theorem 5

1.3 C.3 Parameter estimation

Proposition 1

Lemma 6

Lemma 7

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation