Skip to main content
Log in

Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters

  • Original Paper
  • Published:
Computational Management Science Aims and scope Submit manuscript

Abstract

Inspired by recent advances in distributed algorithms for approximating Wasserstein barycenters, we propose a novel distributed algorithm for this problem. The main novelty is that we consider time-varying computational networks, which are motivated by examples when only a subset of sensors can observe each time step, and yet, the goal is to average signals (e.g., satellite pictures of some area) by approximating their barycenter. We embed this problem into a class of non-smooth dual-friendly distributed optimization problems over time-varying networks and develop a first-order method for this class. We prove non-asymptotic accelerated in the sense of Nesterov convergence rates and explicitly characterize their dependence on the parameters of the network and its dynamics. In the experiments, we demonstrate the efficiency of the proposed algorithm when applied to the Wasserstein barycenter problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. e.g., one can take \(f_i^{\gamma }(x) = f_i(x) + \frac{\gamma }{2}\Vert x\Vert ^2_2\).

References

  • Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43(2):904–924

    Article  Google Scholar 

  • Barrio E, Gine E, Matran C (1999) Central limit theorems for the Wasserstein distance between the empirical and the true distributions. Ann Probab 27(2):1009–1071

    Google Scholar 

  • Bigot J, Cazelles E, Papadakis N (2019) Data-driven regularization of Wasserstein barycenters with an application to multivariate density registration. Inf. Inference J IMA 8(4):719–755

    Google Scholar 

  • Bishop AN, Doucet A (2021) Network consensus in the Wasserstein metric space of probability measures. SIAM J Control Optim 59(5):3261–3277

    Article  Google Scholar 

  • Boissard E, Le Gouic T, Loubes J-M (2015) Distribution’s template estimate with Wasserstein metrics. Bernoulli 21(2):740–759

    Article  Google Scholar 

  • Cuturi M, Peyré G (2015) A smoothed dual approach for variational Wasserstein problems. SIAM J. Imag. Sci. 9(1):320–343

    Article  Google Scholar 

  • Cuturi M, Doucet A (2014) Fast computation of wasserstein barycenters. In: International Conference on Machine Learning. PMLR, pp. 685–693

  • Devolder O, Glineur F, Nesterov Y (2012) Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J Optim 22(2):702–727

    Article  Google Scholar 

  • Dvinskikh D (2021) Decentralized algorithms for wasserstein barycenters. PhD thesis, Humboldt Universitaet zu Berlin (Germany)

  • Dvinskikh D, Gorbunov E, Gasnikov A, Dvurechensky P, Uribe CA (2019) On primal and dual approaches for distributed stochastic convex optimization over networks. In: 2019 IEEE 58th Conference on Decision and Control (CDC), IEEE, pp. 7435– 7440

  • Dvinskikh D, Tiapkin D (2021) Improved complexity bounds in Wasserstein barycenter problem. In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1738– 1746

  • Dvurechenskii P, Dvinskikh D, Gasnikov A, Uribe C, Nedich A (2018) Decentralize and randomize: faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems 31

  • Flamary R, Courty N, Gramfort A, Alaya MZ, Boisbunon A, Chambon S, Chapel L, Corenflos A, Fatras K, Fournier N, Gautheron L, Gayraud NTH, Janati H, Rakotomamonjy A, Redko I, Rolet A, Schutz A, Seguy V, Sutherland DJ, Tavenard R, Tong A, Vayer T (2021) Pot: Python optimal transport. J Mach Learn Res 22(78):1–8

    Google Scholar 

  • Gasnikov AV, Gasnikova E, Nesterov YE, Chernov A (2016) Efficient numerical methods for entropy-linear programming problems. Comput Math Math Phys 56(4):514–524

    Article  Google Scholar 

  • Gorbunov E, Rogozin A, Beznosikov A, Dvinskikh D, Gasnikov A (2022). In: Nikeghbali A, Pardalos PM, Raigorodskii AM, Rassias MT (eds) Recent Theoretical Advances in Decentralized Distributed Convex Optimization. Springer, Cham, pp 253–325. https://doi.org/10.1007/978-3-031-00832-0_8

  • Kantorovich L (1942) On the translocation of masses. (Doklady) Acad Sci URSS (N.S.) 37:199–201

    Google Scholar 

  • Kovalev D, Gasanov E, Gasnikov A, Richtarik P (2021) Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over time-varying networks. Advances in Neural Information Processing Systems 34

  • Kovalev D, Shulgin E, Richtárik P, Rogozin AV, Gasnikov A (2021) ADOM: accelerated decentralized optimization method for time-varying networks. In: International Conference on Machine Learning, pp. 5784– 5793. PMLR

  • Krawtschenko R, Uribe CA, Gasnikov A, Dvurechensky P (2020) Distributed optimization with quantization for computing wasserstein barycenters. arXiv preprint arXiv:2010.14325

  • Kroshnin A, Tupitsa N, Dvinskikh D, Dvurechensky P, Gasnikov A, Uribe C (2019) On the complexity of approximating wasserstein barycenters. In: International Conference on Machine Learning, PMLR, pp. 3530– 3540

  • LeCun, Y (1998) The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/

  • Lemaréchal C, Sagastizábal C (1997) Practical aspects of the moreau-yosida regularization: theoretical preliminaries. SIAM J Optim 7(2):367–385

    Article  Google Scholar 

  • Li H, Lin Z (2021) Accelerated gradient tracking over time-varying graphs for decentralized optimization. arXiv preprint arXiv:2104.02596

  • Monge G (1781) Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris

  • Peyré G, Cuturi M et al (2019) Computational optimal transport: with applications to data science. Found Trends® Mach Learn 11(5–6):355–607

    Article  Google Scholar 

  • Rabin J, Peyré G, Delon J, Bernot M (2011) Wasserstein barycenter and its application to texture mixing. In: International Conference on Scale Space and Variational Methods in Computer Vision, Springer, pp. 435– 446

  • Rockafellar RT (1997) Convex analysis, vol 11. Princeton University Press, Princeton

    Google Scholar 

  • Rogozin A, Beznosikov A, Dvinskikh D, Kovalev D, Dvurechensky P, Gasnikov A (2021) Decentralized distributed optimization for saddle point problems. arXiv preprint arXiv:2102.07758

  • Rogozin A, Bochko M, Dvurechensky P, Gasnikov A, Lukoshkin V (2021) An accelerated method for decentralized distributed stochastic optimization over time-varying graphs. In: 2021 60th IEEE Conference on Decision and Control (CDC), pp. 3367– 3373 . https://doi.org/10.1109/CDC45484.2021.9683110

  • Staib M, Claici S, Solomon JM, Jegelka, S (2017) Parallel streaming wasserstein barycenters. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 2647– 2658. http://papers.nips.cc/paper/6858-parallel-streaming-wasserstein-barycenters.pdf

  • Uribe CA, Dvinskikh D, Dvurechensky P, Gasnikov A, Nedić A (2018) Distributed computation of Wasserstein barycenters over networks. In: 2018 IEEE Conference on Decision and Control (CDC), IEEE, pp. 6544– 6549

  • Uribe CA, Lee S, Gasnikov A, Nedić A (2020) A dual approach for optimal algorithms in distributed optimization over networks. In: 2020 Information Theory and Applications Workshop (ITA), IEEE, pp. 1– 37

  • Villani C (2009) Optimal transport: old and new, vol 338. Springer, Cham

    Google Scholar 

  • Wu X, Lu J (2019) Fenchel dual gradient methods for distributed convex optimization over time-varying networks. IEEE Trans Autom Control 64(11):4629–4636. https://doi.org/10.1109/TAC.2019.2901829

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to Alexander Rogozin for his feedback on the manuscript. This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.

Author information

Authors and Affiliations

Authors

Contributions

OY carried out the research and wrote all Sections except for introduction and the forth one, MP performed all the numerical experiments, AG proposed the main ideas and wrote Section 4, PD supervised the research and wrote the introduction, DK advised on possible methods. All authors reviewed the manuscript.

Corresponding author

Correspondence to Olga Yufereva.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A ADOM and its assumptions

The state-of-the-art numerical computation method for time-varying networks, called ADOM, is developed in Kovalev et al. (2021) and this subsection presents its main objects. It has natural restrictions on the class of suitable problems and, e.g., the Wasserstein barycenter problem lies beyond the requirements of this algorithm. So we modify ADOM to solve more general optimization problems with restrictions. For the sake of consistency, we slightly change the original notation and adduce below the results from Kovalev et al. (2021).

In Kovalev et al. (2021), the optimization problem with the consensus condition is

$$\begin{aligned}{} & {} \min \limits _{\textbf{x}\in {\mathcal {R}}} H(\textbf{x}) = \min \limits _{\textbf{x}\in {\mathcal {R}}} \sum \limits _{i=1}^{m} h_{i}([\textbf{x}]_i),\nonumber \\{} & {} \quad \text {where } {\mathcal {R}}= \left\{ \textbf{x}=([\textbf{x}]_1,\ldots ,[\textbf{x}]_m)\in ({\mathbb {R}}^d)^m \mid [\textbf{x}]_1 = \ldots =[\textbf{x}]_m\right\} , \end{aligned}$$
(10)

where functions \(h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) are assumed to be smooth and strongly convex. Problem (10) is equivalent to the following:

$$\begin{aligned}{} & {} \min \limits _{\textbf{z}\in {\mathcal {R}}^{\perp }}H^*(\textbf{z}), \nonumber \\{} & {} \quad \text {where } {\mathcal {R}}^{\perp }= \left\{ \textbf{z}=([\textbf{z}]_1,\ldots ,[\textbf{z}]_m)\in ({\mathbb {R}}^d)^m \;\Big |\; \sum \limits _{i=1}^m [\textbf{z}]_i = 0\right\} , \end{aligned}$$
(11)

where \(H^*\) is the Fenchel transform of the function H and \({\mathcal {R}}^{\perp }\) is the orthogonal complement of \({\mathcal {R}}\), that exists since \(S = {\mathbb {R}}^d\) here.

Theorem 3

(Kovalev et al. 2021, Theorem 1) Let functions \(h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be L smooth and \(\mu\) strongly convex, \(\textbf{x}^*\) be the solution of the optimization problem (10), \(\textbf{W}_n\) be a communication matrix at the n-th iteration satisfying Assumption 1. Set parameters \(\alpha , \eta , \theta , \sigma ,\tau\) of Algorithm 2 to \(\alpha = \frac{1}{2\,L}\), \(\eta = \frac{2\lambda _{\min }^{+}\sqrt{\mu L}}{7\lambda _{\max }}\), \(\theta = \frac{\mu }{\lambda _{\max }}\), \(\sigma = \frac{1}{\lambda _{\max }}\), and \(\tau = \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\). Then there exists \(C>0\), such that for Fenchel conjugate function \(H^*(\textbf{z})\) from (11)

$$\begin{aligned} \left\| \nabla H^*(\textbf{z}_g^n) - \textbf{x}^*\right\| ^2_2 \le C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{\mu }{L}}\right) ^n. \end{aligned}$$
(12)

Remark 2

Addressing details of the proof of Theorem 1 of Kovalev et al. (2021) we see that there is a particular choice of the constant C, namely

$$\begin{aligned} C = \max \left\{ \frac{2\tau }{\mu ^2}, \frac{\tau (1-\tau )L}{\eta (1-\eta \alpha )\mu ^2} \right\} = \frac{1}{\mu ^2}\max \left\{ \frac{2\lambda _{\min }^{+}\sqrt{\mu }}{7\lambda _{\max }\sqrt{L}}, \frac{1}{2} \right\} = \frac{1}{2\mu ^2}. \end{aligned}$$
(13)

It means that the actual convergence rate is \(n = {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\sqrt{\frac{L}{\mu }}\ln \frac{1}{\mu ^2\varepsilon }\right)\).

Algorithm 2
figure b

ADOM: Accelerated Decentralized Optimization Method

B Proof of Theorem 1

All the arguments below are applied under assumptions of Theorem 1, i.e. we assume that \(S\subset {\mathbb {R}}^d\) is a convex set, \(\textbf{x}\in {\mathcal {S}}\) is equivalent to \([\textbf{x}]_i\in S\) for all \(i=1,\ldots ,m\), functions \(f^{\gamma }_i:S\rightarrow {\mathbb {R}}\) are \(\gamma\) strongly convex, and the output of Algorithm 1 is \(\textbf{x}^n_{r,\gamma } = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)\). Denote also

$$\begin{aligned} \textbf{x}_{\gamma }^* = (x^*_{\gamma },\ldots ,x^*_{\gamma }) = \mathop {\mathrm {arg\,min}}\limits \limits _{\textbf{x}\in {\mathcal {S}}} F^{\gamma }(\textbf{x}) = \mathop {\mathrm {arg\,min}}\limits \limits _{x\in S} \sum \limits _{i=1}^mf_i^{\gamma }(x). \end{aligned}$$

1.1 B.1 Derivation of \((H^{r,\gamma })^*\)

In brief, in this subsection we show that functions \(h_i^{r,\gamma }\) from (14) are \(\frac{1}{r}\) smooth, \(\frac{\gamma }{1+r\gamma }\) strongly convex, and such that \(\nabla (H^{r,\gamma })^*\) from Line 3 of Algorithm 1 is the gradient of the conjugate function \((H^{r,\gamma })^*\) of \(H^{r,\gamma } = \sum \nolimits _{i=1}^m h_i^{r,\gamma }\) from (14). Then the consensus condition (4) becomes a corollary of Theorem 3 with \(L = \frac{1}{r}\) and \(\mu = \frac{\gamma }{1+r\gamma }\).

From now on let functions \(h_i^{r,\gamma }:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) and \(H^{r,\gamma }:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}\) be

$$\begin{aligned} \begin{array}{cc} H^{r,\gamma }(\textbf{x}) = \sum \limits _{i=1}^{m} h_i^{r,\gamma }([\textbf{x}]_i), \ \text{ where } \\ h_i^{r,\gamma }(x) = \inf \limits _{y\in S}\left\{ f_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} . \end{array} \end{aligned}$$
(14)

Define their conjugate as \((h_i^{r,\gamma })^*\) and \((H^{r,\gamma })^*\).

Lemma 1

If functions \(h_i^{r,\gamma }\) and \(H^{r,\gamma }\) are defined by (14), then their Fenchel conjugate functions \((h_i^{r,\gamma })^*\) and \((H^{r,\gamma })^*:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}\) are

$$\begin{aligned} \begin{array}{cc} (H^{r,\gamma })^*(\textbf{z}) = \sum \limits _{i=1}^{m} (h_i^{r,\gamma })^*([\textbf{z}]_i), \ \text{ where } \\ (h_i^{r,\gamma })^*(z) = (f_i^{\gamma })^*(z) + \frac{r}{2}\Vert z\Vert ^2_2. \end{array} \end{aligned}$$

Moreover, its conjugate \((H^{r,\gamma })^{**}\) coincides with \(H^{r,\gamma }\).

Proof

The definition (14) is similar to Moreau–Yosida smoothing, but the tricky point is that the functions \(f_i^{\gamma }\) are defined on a convex set S instead of the \({\mathbb {R}}^d\). Let us introduce functions \({\tilde{f}}_i^{\gamma }\) with domain \({\mathbb {R}}^d\) as follows:

$$\begin{aligned} {\tilde{f}}_i^{\gamma }(x)=\left\{ \begin{array}{cc} &{}f_i^{\gamma }(x) \quad \hbox { if}\ x\in S\\ &{} +\infty , \quad \text{ otherwise }. \end{array}\right. \end{aligned}$$
(15)

Such \({\tilde{f}}_i^{\gamma }\) are \(\gamma\) strongly convex as well. Moreover, substitution \({\tilde{f}}_i^{\gamma }\) for \(f_i^{\gamma }\) affect neither primal \(h_i^{r,\gamma }\):

$$\begin{aligned} h_i^{r,\gamma }(x) = \inf \limits _{y\in S}\left\{ f_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} = \inf \limits _{y\in {\mathbb {R}}^d}\left\{ {\tilde{f}}_i^{\gamma }(y) + \frac{1}{2r}\Vert y-x\Vert ^2_2\right\} ,\end{aligned}$$

nor \((f_i^{\gamma })^*(z) + \frac{r}{2}\Vert z\Vert ^2_2\):

$$\begin{aligned} (f^{\gamma }_i)^*(z) + \frac{r}{2}\Vert z\Vert ^2_2&=\max \limits _{x \in S}\left\{ \left\langle z, x\right\rangle - f^{\gamma }_i(x)\right\} + \frac{r}{2}\Vert z\Vert ^2_2 \\&=\max \limits _{x \in {\mathbb {R}}^d}\left\{ \left\langle z, x\right\rangle - {\tilde{f}}^{\gamma }_i(x)\right\} + \frac{r}{2}\Vert z\Vert ^2_2 \\&=({\tilde{f}}^{\gamma }_i)^*(z)+ \frac{r}{2}\Vert z\Vert ^2_2. \end{aligned}$$

For each i one can see that \((h_i^{r,\gamma })^* = (f^{\gamma }_i)^*(z) + \frac{r}{2}\Vert z\Vert ^2_2\) is the Fenchel conjugate of \(h_i^{r,\gamma }\) and vice versa. Indeed, for proper, convex and lower semicontinuous \(g_1, g_2:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) we have \((g_1+ g_2)^*(x) = g_1^* \square g_2^*\) and \((g_1 \square g_2)^* = g_1^* + g_2^*\), where \((g_1\square g_2)(x)\) means the convolution \(\inf \{g_1(y) + g_2(x-y) \mid y\in {\mathbb {R}}^d \}\).

Hence the Fenchel conjugate for the function \(H^{r,\gamma }\) will be

$$\begin{aligned}{} & {} \sup \limits _{\textbf{x}\in ({\mathbb {R}}^d)^m} \left\{ \langle \textbf{z},\textbf{x}\rangle - H^{r,\gamma }(\textbf{x})\right\} \nonumber \\{} & {} \quad = \sup \limits _{\textbf{x}\in ({\mathbb {R}}^d)^m} \left\{ \sum \limits _{i=1}^{m} \left( \langle [\textbf{z}]_i,[\textbf{x}]_i \rangle - h_{i}^{r,\gamma }([\textbf{x}]_i)\right) \right\} \nonumber \\{} & {} \quad =\sum \limits _{i=1}^{m} \sup \limits _{[\textbf{x}]_i\in {\mathbb {R}}^d} \left\{ \langle [\textbf{z}]_i,[\textbf{x}]_i \rangle - h_{i}^{r,\gamma }([\textbf{x}]_i)\right\} \nonumber \\{} & {} \quad = \sum \limits _{i=1}^{m} (h_{i}^{r,\gamma })^*([\textbf{z}]_i) = (H^{r,\gamma })^*(\textbf{z}). \end{aligned}$$
(16)

In the same way one can see that \(H^{r,\gamma }\) and \((H^{r,\gamma })^{**}\) coincide. \(\square\)

Remark 3

For each i the function \(\left( h^{r,\gamma }_{i}\right) ^*\) from (14) is \(\left( \frac{1}{\gamma }+r\right)\) smooth and r strongly convex by definition, so we have \(h^{r,\gamma }_i = (h^{r,\gamma }_i)^{**}\) being \(\frac{1}{r}\) smooth and \(\frac{\gamma }{1+r\gamma }\) strongly convex. In addition

$$\begin{aligned}\nabla (h^{r,\gamma }_i)^*(z) = \nabla (f_i^{\gamma })^*(z) + z \end{aligned}$$

as stated in Line 3 of Algorithm 1. Then we can apply Algorithm 2 for \(L = r^{-1}\) smooth and \(\mu = \frac{\gamma }{1+r\gamma }\) strongly convex functions \(h^{r,\gamma }_i\) and get the values of \(\nabla (h^{r,\gamma }_i)^*(z)\) as output.

Thus we construct a relaxation \(\min _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})\) of the constrained convex optimization problem \(\min _{\textbf{x}\in {\mathcal {S}}}F^{\gamma }(\textbf{x})\).

Corollary 2

Let a function \(H^{r,\gamma }\) be defined in (14) and let \(\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})\). Then applying Algorithm 2 for

$$\begin{aligned}\nabla (h^{r,\gamma }_i)^*(z) = (f_i^{\gamma })^*(z) + rz\end{aligned}$$

we get by Theorem 3

$$\begin{aligned} \left\| \textbf{x}^{*}_{r,\gamma } - \textbf{x}^{n}_{r,\gamma }\right\| ^2_2\le C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n, \end{aligned}$$
(17)

where \(\textbf{x}_{r,\gamma }^n = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)\) and

$$\begin{aligned}C = \frac{(1+r\gamma )^2}{2\gamma ^2}.\end{aligned}$$

Moreover, since \(\textbf{x}^*_{r,\gamma } \in {\mathcal {R}}\), i.e. \([\textbf{x}^*_{r,\gamma }]_i = [\textbf{x}^*_{r,\gamma }]_j\) for all i and j, the consensus condition is approximated as follows

$$\begin{aligned} \left\| \left[ \textbf{x}^{n}_{r,\gamma }\right] _i - \left[ \textbf{x}^{n}_{r,\gamma }\right] _j\right\| ^2_2\le 2C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n. \end{aligned}$$

1.2 B. 2 Value bounds on \(H^{r,\gamma }\)

Despite we defined \(h_i^{r,\gamma }\) for all \({\mathbb {R}}^d\), some properties hold true on the initial set S only.

Lemma 2

Let functions \(h_i^{r,\gamma }\) be defined in (14). If \(x\in S\), then for any \(r>0\), for each \(i = 1,\ldots ,m\) we have

$$\begin{aligned} \begin{array}{rl} f^{\gamma }_i(x) - \frac{r}{2(1+r\gamma )}\left\| \nabla f^{\gamma }_i(x)\right\| ^2_2 \le h_i^{r,\gamma }(x) \le f^{\gamma }_i(x). \end{array} \end{aligned}$$
(18)

Proof

The second inequality directly follows from the definition (14). To prove the first one we recall that \(f^{\gamma }_i\) is \(\gamma\) strongly convex and the following holds:

$$\begin{aligned} h_i^{r,\gamma }(x)= & {} \inf \limits _{y \in S} \left\{ f^{\gamma }_i(y) + (2r)^{-1}\Vert x-y\Vert ^2_2\right\} \\= & {} \inf \limits _{y: \ (x-y)\in S} \left\{ f^{\gamma }_i(x-y) + (2r)^{-1}\Vert y\Vert ^2_2\right\} \\\ge & {} \inf \limits _{y: \ (x-y)\in S} \left\{ f^{\gamma }_i(x) + \langle \nabla f^{\gamma }_i(x), -y \rangle + \gamma /2 \Vert y\Vert ^2_2 + (2r)^{-1}\Vert y\Vert ^2_2\right\} \\\ge & {} \inf \limits _{y\in {\mathbb {R}}^d} \left\{ f^{\gamma }_i(x) + \langle \nabla f^{\gamma }_i(x), -y \rangle + \gamma /2 \Vert y\Vert ^2_2 + (2r)^{-1}\Vert y\Vert ^2_2\right\} , \end{aligned}$$

which reaches its minimum at \(y=\frac{r}{1+r\gamma }\nabla f^{\gamma }_i(x)\) and so equals to

$$\begin{aligned}{} & {} f^{\gamma }_i(x) + \frac{r}{1+r\gamma } \langle -\nabla f^{\gamma }_i(x), \nabla f^{\gamma }_i(x) \rangle + \frac{r}{2(1+r\gamma )}\Vert \nabla f^{\gamma }_i(x)\Vert ^2_2 \\{} & {} \quad \quad \quad \quad =f^{\gamma }_i(x) - \frac{r}{2(1+r\gamma )}\Vert \nabla f^{\gamma }_i(x)\Vert ^2_2. \end{aligned}$$

\(\square\)

1.3 B.3 Convergence in argument

Lemma 3 shows convergence in argument in the following sense: if the regularization parameter r tends to zero, the argminimum \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) of \(H^{r,\gamma }\) tends to the argminimum \(\textbf{x}^*_{\gamma }\in {\mathcal {S}}\) of \(F^{\gamma }\). By Corollary 2 we have \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) approximated by \(\textbf{x}^n_{r,\gamma }\in ({\mathbb {R}}^d )^m\) for a sufficient number of iterations n.

Lemma 3

Let \(\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}\in {\mathcal {R}}} H^{r,\gamma }(\textbf{x})\) for \(H^{r,\gamma }\) defined in (14). Let

$$\begin{aligned} \Vert \nabla F^{\gamma }(\textbf{x})\Vert _2^2 \le m K_{\zeta }^2 \quad \forall \textbf{x}\in \{\textbf{y}\in {\mathcal {S}}\mid \Vert \textbf{y}-\textbf{x}^*_{\gamma }\Vert _2\le \zeta \}. \end{aligned}$$
(19)

If r is such that \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\), then

$$\begin{aligned} \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \sqrt{\frac{rm}{2\gamma }}K_{\zeta }. \end{aligned}$$
(20)

Proof

Using (18) and strong convexity of \(F^{\gamma }\) and \(H^{r,\gamma }\) we have

$$\begin{aligned} F^{\gamma } (\textbf{x}^*_{\gamma }){} & {} \ge H^{r,\gamma }(\textbf{x}^*_{\gamma }) = \sum h_i^{r,\gamma }([\textbf{x}^*_{\gamma }]_i) \\{} & {} \ge \sum \limits _{i=1}^m \left( h_i^{r,\gamma }(\textbf{x}^*_{r,\gamma }) + \frac{\gamma }{2(1+r\gamma )}\Vert [\textbf{x}^*_{r,\gamma }]_i - [\textbf{x}^*_{\gamma }]_i\Vert _2^2\right) \\{} & {} \quad = H^{r,\gamma }(\textbf{x}^*_{r,\gamma }) + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{r,\gamma }) - \frac{r}{2(1+r\gamma )}\Vert \nabla F^{\gamma }(\textbf{x}^*_{r,\gamma })\Vert ^2_2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{r,\gamma }) - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{\gamma }) + \gamma /2 \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2 + \frac{\gamma }{2(1+r\gamma )}\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 \\{} & {} \ge F^{\gamma }(\textbf{x}^*_{\gamma }) + \frac{\gamma }{1+r\gamma } \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2. \end{aligned}$$

Then \(\frac{\gamma }{1+r\gamma } \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2\le 0\) and hence \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert ^2_2 \le \frac{rm}{2\gamma }K^2_{\zeta }\). \(\square\)

Combining Lemma 3 with Corollary 2 we get the following.

Remark 4

Let \(\zeta >0\) and let \(K_{\zeta }\) be such that (19) holds. If

$$\begin{aligned} \sqrt{\frac{rm}{2\gamma }}K_{\zeta }+\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \zeta , \end{aligned}$$

where \(C_1 = \frac{(1+r\gamma )^2}{2\gamma ^2}\), then both \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\) and \(\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\) hold.

1.4 B.4 Value approximation

Let \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) be the only argminimum of \(H^{r,\gamma }\) on the consensus space \({\mathcal {R}}\), i.e.

$$\begin{aligned} \begin{array}{cc} \textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\limits \limits _{\textbf{x}\in {\mathcal {S}}} H^{r,\gamma }(\textbf{x}). \end{array} \end{aligned}$$
(21)

In order to prove the value approximation (5) let us separate it into parts and estimate each of them:

$$\begin{aligned}{} & {} F^{\gamma }(\textbf{x}^{n}_{r,\gamma }) - F^{\gamma }(\textbf{x}^*_{\gamma }) \end{aligned}$$
(22a)
$$\begin{aligned}{} & {} \quad \le F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \end{aligned}$$
(22b)
$$\begin{aligned}{} & {} \quad + H^{r,\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) \end{aligned}$$
(22c)
$$\begin{aligned}{} & {} \quad + H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }). \end{aligned}$$
(22d)

The last addend is negative and can be eliminated:

$$\begin{aligned} H^{r,\gamma }(\textbf{x}^{*}_{r,\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }) \le H^{r,\gamma }(\textbf{x}^{*}_{\gamma }) - F^{\gamma }(\textbf{x}^{*}_{\gamma }) \le 0. \end{aligned}$$

The rest are estimated in Lemmas 4 and 5 under additional assumptions.

Lemma 4

Let \(\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\). If (19) holds, then

$$\begin{aligned} F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \le \frac{r}{2(1+r\gamma )}mK^2_{\zeta }. \end{aligned}$$
(23)

Proof

We cannot declare a uniform K instead of \(K_{\zeta }\) because \(F^{\gamma }\) is not smooth. Nonetheless, assuming \(\textbf{x}^{n}_{r,\gamma }\) belong to \(\zeta\)-neighborhood of \(\textbf{x}^*_{\gamma }\), we immediately obtain from (18) and (19) that

$$\begin{aligned} F^{\gamma }(\textbf{x}^n_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) \le \frac{r}{2(1+r\gamma )}\Vert \nabla F^{\gamma }(\textbf{x}^{n}_{r,\gamma })\Vert ^2_2 \le \frac{r}{2(1+r\gamma )}mK^2_{\zeta }. \end{aligned}$$

\(\square\)

Lemma 5

Let (19) holds. Then

$$\begin{aligned} H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^*_{r,\gamma })&\le C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}, \\ \text{ where } \quad C_2&= \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{m(1+r\gamma )^2}{4r\gamma ^2}. \end{aligned}$$

Proof

By \(\frac{m}{r}\) smoothness of \(H^{r,\gamma }\)

$$\begin{aligned} \begin{array}{rl} H^{r,\gamma }(\textbf{x}^{n}_{r,\gamma }) - H^{r,\gamma }(\textbf{x}^*_{r,\gamma }) &{}\le \langle \nabla H^{r,\gamma }\left( \textbf{x}^*_{r,\gamma }\right) ,\textbf{x}^{n}_{r,\gamma } - \textbf{x}^{*}_{r,\gamma }\rangle + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2 \\ &{}\le \langle \nabla H^{r,\gamma }\left( \nabla (H^{r,\gamma })^*(\textbf{z}_g^{\infty })\right) ,\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) - \textbf{x}^{*}_{r,\gamma }\rangle \\ &{}\quad + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2 \\ &{}\le \langle \textbf{z}_g^{\infty }, \nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) - \textbf{x}^{*}_{r,\gamma }\rangle + \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma } - \textbf{x}^*_{r,\gamma }\Vert ^2_2, \end{array} \end{aligned}$$

where \(\textbf{z}^{\infty }_{g}\) is the limit of \(\textbf{z}_g^n\) and so it is the argminimum of \((H^{r,\gamma })^*\) on \({\mathcal {R}}^{\perp }\). By (17) we have

$$\begin{aligned}{} & {} \frac{m}{2r}\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{r,\gamma }\Vert ^2_2 \le \frac{m}{2r}C_1\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \\{} & {} \quad = \frac{m(1+r\gamma )^2}{4r\gamma ^2}\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n\end{aligned}$$

Let us introduce an orthogonal projection matrix \(\textbf{P}\) onto the subspace \({\mathcal {R}}^{\perp }\), i.e., it holds \(\textbf{P}v = \mathop {\mathrm {arg\,min}}\limits _{z \in {\mathcal {R}}^{\perp }} \{v - z\}\) for an arbitrary \(v \in ({\mathbb {R}}^d)^n\). Then matrix \(\textbf{P}\) is

$$\begin{aligned} \textbf{P}= \left( \textbf{I}_n - \frac{1}{n}\textbf{1}_n\textbf{1}_n^{\top }\right) \otimes \textbf{I}_d, \end{aligned}$$
(24)

where \(\textbf{I}_n\) denotes \(n\times n\) identity matrix, \(\textbf{1}_n = (1,\ldots ,1)\in {\mathbb {R}}^n\), and \(\otimes\) is a Kronecker product. Note that \(\textbf{P}^{\top }\textbf{P}= \textbf{P}\).

Since \(\textbf{z}_g^{\infty }\in {\mathcal {R}}^{\perp }\) and \(\textbf{x}^{*}_{r,\gamma }\in {\mathcal {R}}\), the first part simplifies to \(\langle \textbf{z}_g^{\infty }, \textbf{P}\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) \rangle\). We may use Lemma 2 in Kovalev et al. (2021) to get the following estimation

$$\begin{aligned} \begin{array}{rl} \Vert \textbf{P}\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n})\Vert _2^2 = \Vert \nabla (H^{r,\gamma })^*(\textbf{z}_g^{n})\Vert _{\textbf{P}}^2 \le \frac{2}{\theta \lambda _{\min }^{+}}\left( (H^{r,\gamma })^*(\textbf{z}^n_g) - (H^{r,\gamma })^*(\textbf{z}^{n+1}_f)\right) . \end{array} \end{aligned}$$

As \(\textbf{z}^{n+1}_f\) is a non-optimal point of Algorithm 1, this is not greater than

$$\begin{aligned}{} & {} \frac{2}{\theta \lambda _{\min }^{+}}\left( (H^{r,\gamma })^*(\textbf{z}^n_g) - (H^{r,\gamma })^*(\textbf{z}^*)\right) \\{} & {} \quad \le \frac{m(1+r\gamma )}{\gamma \theta \lambda _{\min }^{+}}\left\| \textbf{z}^n_g - \textbf{z}^*\right\| _2^2 = \frac{m(1+r\gamma )^2}{\gamma ^2}\frac{\lambda _{\max }}{\lambda _{\min }^{+}}\left\| \textbf{z}^n_g - \textbf{z}^*\right\| _2^2 \\{} & {} \quad \le \frac{m(1+r\gamma )^2}{2\gamma ^2}\frac{\lambda _{\max }}{\lambda _{\min }^{+}} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \end{aligned}$$

and the latter ones follow from the \(\frac{m(1+r\gamma )}{\gamma }\) smoothness of \((H^{r,\gamma })^*\) and from the fact that the proof of (Kovalev et al. 2021, Theorem 1) actually covers the following chain of inequalities:

$$\begin{aligned} \left\| \nabla H^*(\textbf{z}_g^n) - \textbf{x}^*\right\| ^2_2 \le \frac{1}{\mu ^2}\left\| \textbf{z}^n_g - \textbf{z}^* \right\| ^2_2 \le C \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\right) ^n = \frac{1}{2\mu ^2} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\right) ^n. \end{aligned}$$

By our assumption \(\Vert \textbf{z}^{\infty }_{g}\Vert _2 = \Vert \nabla H^{r,\gamma }(x_{r,\gamma }^*)\Vert _2< \sqrt{m}K_{\zeta }\). Thus, we obtain

$$\begin{aligned}{} & {} (H^{r,\gamma })^*(\textbf{x}^n_{r,\gamma }) - (H^{r,\gamma })^*(\textbf{x}^*_{r,\gamma }) \\{} & {} \quad \le \sqrt{m}K_{\zeta }\frac{\sqrt{m}(1+r\gamma )}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2} \\{} & {} \qquad + \frac{m(1+r\gamma )^2}{4r\gamma ^2}\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^n \\{} & {} \quad \le \left( \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{m(1+r\gamma )^2}{4r\gamma ^2}\right) \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}. \\{} & {} \quad = C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}. \end{aligned}$$

\(\square\)

1.5 B.5 Final compilation

This section completes the proof of Theorem 1 and shows Remark 1.

Recall that where \(C_1 = \frac{(1+r\gamma )^2}{2\gamma }\) and

$$\begin{aligned} C_2 = \frac{m}{2r}C_1 + \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\frac{\lambda _{\max }}{\lambda _{\min }^{+}} = \frac{m(1+r\gamma )^2}{4r\gamma } + \frac{m(1+r\gamma )K_{\zeta }}{\sqrt{2}\gamma }\frac{\lambda _{\max }}{\lambda _{\min }^{+}}. \end{aligned}$$

By Remark 4 and Lemmas 4, 5 we see that \(F^{\gamma }(\textbf{x}^{n}_{r,\gamma }) - F^{\gamma }(\textbf{x}^*_{\gamma })< \varepsilon\) if

$$\begin{aligned} \forall \textbf{x}\in \{\textbf{y}\in \mid \Vert \textbf{y}-\textbf{x}_{\gamma }^*\Vert _2<\zeta \} \qquad \Vert \nabla F^{\gamma }(\textbf{x})\Vert _2^2&<mK^2_{\zeta }, \end{aligned}$$
(25)
$$\begin{aligned} \sqrt{\frac{rm}{2\gamma }}K_{\zeta }+\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}&\le \zeta , \end{aligned}$$
(26)
$$\begin{aligned} \frac{r}{2(1+r\gamma )}mK^2_{\zeta }&\le \varepsilon /2, \end{aligned}$$
(27)
$$\begin{aligned} C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}&\le \varepsilon /2. \end{aligned}$$
(28)

Let \(\zeta =\sqrt{\varepsilon / \gamma }\) and let \(r\le \frac{\varepsilon }{2mK^2_{\zeta }}\). Then (27) holds. If (28) fulfills, then (26) follows from (27) and (28) as \(\sqrt{\frac{rm}{2\gamma }}K_{\zeta }\le \sqrt{\frac{\varepsilon }{2\gamma }} \le \zeta /\sqrt{2}\) and \(\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \zeta /2\) since \(1\le \sqrt{C_1}\le C_1\le C_2\) and \(\varepsilon \le \sqrt{\varepsilon /\gamma } = \zeta\). Thus, it suffices to assume

$$\begin{aligned}&\forall i \quad \forall x\in \lbrace y\in S \mid \Vert y-x_{\gamma }^*\Vert _2^2\le \varepsilon / \gamma \rbrace \qquad \Vert \nabla f_i^{\gamma }(x)\Vert _2 \le K, \\&\quad r\le \frac{\varepsilon }{2mK^2}, \\&\quad C_2 \left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2} \le \varepsilon /2. \end{aligned}$$

So \(\varepsilon\) approximation requires a number of iteration

$$\begin{aligned} {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\sqrt{\frac{1+r\gamma }{r\gamma }}\ln \frac{C_2}{\varepsilon }\right) = {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\frac{1}{\sqrt{\gamma \varepsilon }}\ln \frac{1}{\varepsilon }\right) . \end{aligned}$$

C Proof of Theorem 2

To prove Theorem 2 we combine proved Theorem 1 with features of the entropy regularization of the Wasserstein barycenter problem.

1.1 C.1 Entropy regularized WB problem

Recall that for a fixed cost matrix M we define the set of transport plans as

$$\begin{aligned} U(p,q) := \left\{ X \in {\mathbb {R}}_+^{d \times d} \mid X \textbf{1}= p, X^T\textbf{1}= q \right\} \end{aligned}$$

and Wasserstein distance between two probability distributions p and q as

$$\begin{aligned}{\mathcal {W}}(p,q):= \min _{X \in U(p,q)} \langle M, X \rangle . \end{aligned}$$

The entropy regularized (or smoothed) Wasserstein distance is defined as

$$\begin{aligned} {\mathcal {W}}_{\gamma } (p,q) := \min _{X \in U(p,q)} \left\{ \langle M, X \rangle - \gamma E(X)\right\} , \end{aligned}$$
(29)

where \(\gamma >0\) and

$$\begin{aligned} E(X) := - \sum _{i=1}^{d} \sum _{j=1}^{d} e(X_{ij}), \nonumber \\ \text {where } e(x)=\left\{ \begin{array}{ll} x\ln x \quad &{} \text{ if } x>0 \\ 0 \quad &{} \text{ if } x=0. \end{array}\right. \end{aligned}$$
(30)

So it seeks to minimize the transportation costs while maximizing the entropy. Moreover \({\mathcal {W}}_{\gamma }(p,q)\rightarrow {\mathcal {W}}(p,q)\) as \(\gamma \rightarrow 0\).

Then the convex optimization problem (7) can be relaxed to the following \(\gamma\) strongly convex optimization problem

$$\begin{aligned} \min _{p \in S_1(d)}\sum \limits _{i=1}^{m} {\mathcal {W}}_{\gamma ,q_i}(p), \end{aligned}$$
(31)

where \({\mathcal {W}}_{\gamma ,q_i}(p) = {\mathcal {W}}_{\gamma }(q_i, p)\). The argminimum of (31) is called the uniform Wasserstein barycenter (Agueh and Carlier 2011; Cuturi and Doucet 2014) of the family of \(q_1,\ldots , q_m\). Moreover, problem (31) admits a unique solution and approximates the unregularized WB problem as follows.

Remark 5

Let \(\gamma \le \frac{\varepsilon }{4}\ln d\). If vectors \(\hat{p}_i\in S_1(d)\) are such that

$$\begin{aligned}\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(\hat{p}_i) - \min \limits _{p\in S_1(d)}\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(p) \le \frac{\varepsilon }{2},\end{aligned}$$

then

$$\begin{aligned}\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(\hat{p}_i) - \min \limits _{p\in S_1(d)}\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(p)\le \varepsilon .\end{aligned}$$

Indeed, as entropy is bounded we have \({\mathcal {W}}_{q_i}(p)\le {\mathcal {W}}_{\gamma , q_i}(p)\le {\mathcal {W}}_{q_i}(p) + 2\gamma \ln d\) for all i and p. Then, for \(p^* = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{q_i}(p)\) and \(p^*_{\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p)\) it holds that

$$\begin{aligned}&\sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(\hat{p}_i) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{q_i}(p^*) \\&\quad \le \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(\hat{p}) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p^*) + 2\gamma \ln d \\&\quad \le \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(\hat{p}) - \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p^*_{\gamma }) + \frac{\varepsilon }{2} \le \varepsilon . \end{aligned}$$

1.2 C.2 Legendre transforms

One particular advantage of entropy regularization of the Wasserstein distance is that it yields closed-form representations for the dual function \({\mathcal {W}}^*_{\gamma , q}(\cdot )\) and for its gradient. Recall that the Fenchel-Legendre transform of (29) is defined as

$$\begin{aligned} {\mathcal {W}}^*_{\gamma ,q}(z)&:= \max _{p \in S_1(d)}\left\{ \left\langle z,p\right\rangle - { {\mathcal {W}}_{\gamma ,q} (p)}\right\} . \end{aligned}$$
(32)

Theorem 4

((Cuturi and Peyré 2015, Theorem 2.4)) For \(\gamma >0\), the Fenchel-Legendre dual function \({\mathcal {W}}^*_{\gamma ,q}(z)\) is differentiable

$$\begin{aligned} \begin{array}{rl} {\mathcal {W}}^*_{\gamma ,q}(z) &{} = \gamma \left( E(q) + \left\langle q, \ln {\mathcal {K}}\alpha \right\rangle \right) \\ {} &{}= - \gamma \left\langle q,\ln q \right\rangle + \gamma \sum \limits _{j=1}^{m} [q]_j \ln \left( \sum \limits _{i=1}^{m}\exp \left( \frac{1}{\gamma }\left( [z]_i-M_{ji}\right) \right) \right) \end{array} \end{aligned}$$
(33)

and its gradient \(\nabla {\mathcal {W}}^*_{\gamma ,q}(z)\) is \(1/\gamma\)-Lipschitz in the 2-norm with

$$\begin{aligned} \begin{array}{rl} \nabla {\mathcal {W}}^*_{\gamma ,q}(z) &{}= \alpha \circ \left( {\mathcal {K}}\cdot {q}/({{\mathcal {K}}\alpha })\right) \in S_1(d), \\ \left[ \nabla {\mathcal {W}}^*_{\gamma ,q}(z) \right] _l &{} =\sum \limits _{j=1}^{m} [q]_j \frac{\exp \left( \frac{1}{\gamma }([z]_l - M_{lj})\right) }{\sum \limits _{i=1}^{m}\exp \left( \frac{1}{\gamma }([z]_i - M_{ij})\right) }. \end{array} \end{aligned}$$
(34)

where \(z \in {\mathbb {R}}^n\) and for brevity we denote \(\alpha = \exp ( {z}/{\gamma })\) and \({\mathcal {K}}= \exp \left( {-M}/{\gamma }\right)\).

Notice that to get back and obtain the approximated barycenter we can employ the following result (with \(\lambda _i = 1\)).

Theorem 5

(Cuturi and Peyré (2015), Theorem 3.1) The barycenter \(p^*\) solving (31) satisfies

$$\begin{aligned} \forall i=1,\ldots ,m \qquad p^* =\nabla {\mathcal {W}}^*_{\gamma ,q_i}(z^*_i), \end{aligned}$$

where the set of \(z^*_i\) constitutes any solution of any smoothed dual WB problem:

$$\begin{aligned} \min \limits _{z_1,\ldots ,z_m\in {\mathbb {R}}^d} \sum \limits _{i=1}^{m}\lambda _i{\mathcal {W}}^*_{\gamma ,q_i}(z_i) \quad \text{ s.t. } \quad \sum \limits _{i=1}^{m}\lambda _i z_i = 0. \end{aligned}$$

Thus we can apply Theorem 1 for the problem (31) with explicitly defined \(\nabla {\mathcal {W}}^*_{\gamma , q_i}\) and obtain \(\textbf{x}^n_{r,\gamma }\) that satisfies

$$\begin{aligned}&\sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}([\textbf{x}^n_{r,\gamma }]_i) - \min \limits _{p\in S_1(d)} \sum \limits _{i=1}^{m}{\mathcal {W}}_{\gamma ,q_i}(p) \\&\quad \le \frac{r}{4(1+r\gamma )}mK^2 + \frac{1}{2}C_2\left( 1 - \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \varepsilon /2. \end{aligned}$$

By Remark 5 it proves

$$\begin{aligned}&\left| \sum \limits _{i=1}^m {\mathcal {W}}_{q_i}([\textbf{x}^{n}_{r,\gamma }]_i) - \sum \limits _{i=1}^m {\mathcal {W}}_{q_i}([\textbf{p}^*]_i)\right| \\&\quad \le 2\gamma \ln d + \frac{r}{4(1+r\gamma )}mK^2 + C \left( 1- \frac{\lambda _{\min }^{+}}{7\lambda _{\max }} \sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n}/2\le \varepsilon , \end{aligned}$$

for \(C=\frac{1}{2}C_2 = \frac{(1+r\gamma )mK_{\zeta }}{2\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{(1+r\gamma )^2}{8r\gamma ^2}\).

1.3 C.3 Parameter estimation

It remains to assign \(\zeta >0\) and \(K = K_{\zeta }\) satisfying (25). Due to Assumption 2 such \(\zeta\) and K exist.

Proposition 1

Let a set \(\{q_i\}_{i=1}^m\) satisfies Assumption 2, let \(p^*_{\gamma }\) be the uniform Wasserstein barycenter of \(\{q_i\}_{i=1}^m\), and let \(\zeta \in \left( 0, \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\right)\). For each \(i = 1,\ldots ,m\) the norm of the gradient \(\Vert \nabla {\mathcal {W}}_{\gamma ,q_i}(\cdot )\Vert _2^2\) is uniformly bounded over \(\{p\in S_1(d)\mid \Vert p-p^*_{\gamma } \Vert _2^2 \le \zeta \};\) and the bound \(K_{\rho }\) is given in (35) for \(\rho \le \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}-\zeta .\)

We obtain Proposition 1 as a combination of Lemma 6 from Bigot et al. (2019) and proved below Lemma 7.

Lemma 6

(Bigot et al. (2019), Lemma 3.5) For any \(\rho \in (0,1)\), \(q\in S_1(d)\), and \(p\in \{x\in S_1(d)\mid \min _l x_l\ge \rho \}\) there is a bound: \(\Vert \nabla {\mathcal {W}}_{\gamma , q}(p) \Vert ^2_2\le K_{\rho }\), where

$$\begin{aligned} K_{\rho } = \sum \limits _{j=1}^{d}\left( 2\gamma \ln d + \inf _i\sup _l |M_{jl} - M_{il}| - \gamma \ln \rho \right) ^2. \end{aligned}$$
(35)

Lemma 7

Let a set \(\{q_i\}_{i=1}^m\) satisfies Assumption 2, let \(p^*_{\gamma }\) be the uniform Wasserstein barycenter of \(\{q_i\}_{i=1}^m\). All components k of \(p^*_{\gamma }\) have a uniform positive lower bound: \([p^*_{\gamma }]_k\ge \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\).

Proof

Let \(X^*_i\) denote the optimal transport plan between \(p^*_{\gamma }\) and \(q_i\). Assume the contrary: there is k such that \([p^*_{\gamma }]_k < \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\). Then there is another component n such that \([p^*_{\gamma }]_n>\min _i [q_i]_n > \min _{i,l} [q_i]_l\). Consider the vector p that consists of \([p]_i = [p^*_{\gamma }]_i\) except for the components \([p]_n = [p^*_{\gamma }]_n+\delta\) and \([p]_l = [p^*_{\gamma }]_l-\delta\), where \(\delta >0\) is less than \(\min _{i,a\not =b}[X_i^*]_{a,b}\) of the optimal transport plans \(X^*_i\) between \(p^*_{\gamma }\) and \(q_n\). Because of the entropy, all these optimal transport plans contain only positive non-diagonal elements, so such a \(\delta\) exists.

Construct now non-optimal transport plans between p and each of \(q_i\) in order to get the contradiction with the assumption. Initially we have \({\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma }) = \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i\). Consider the matrix \(X_i\) that differs from \(X^*_i\) only at four elements:

$$\begin{aligned}{}[X_i]_{kk} = [X_i^*]_{kk} +\frac{1}{2}\delta , \quad [X_i]_{kn} = [X_i^*]_{kn} +\frac{1}{2}\delta , \\ [X_i]_{nn} = [X_i^*]_{nn} +\frac{1}{2}\delta , \quad [X_i]_{nk} = [X_i^*]_{nk} +\frac{1}{2}\delta . \end{aligned}$$

Then \(X_i\) is a transport plan between p and \(q_i\) since its elements are positive and also \(X_i \textbf{1}= p\) and \(X_i^{\top }\textbf{1}= q_i\). Using the monotonicity of entropy on the interval \((0,\frac{1}{e})\) and the assumption that diagonal elements of the cost matrix C are zero, we get for each i:

$$\begin{aligned} \begin{array}{rl} {\mathcal {W}}_{\gamma ,q_i}(p) \le &{} \langle C, X_i \rangle - \gamma X_i \ln X_i \\ =&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i +\frac{1}{2}\delta C_{kn}-\frac{1}{2}\delta C_{nk} \\ +&{} ([X_i]_{kk}\ln [X_i]_{kk} - [X^*_i]_{kk}\ln [X^*_i]_{kk}) \\ +&{} ([X_i]_{kn}\ln [X_i]_{kn} - [X^*_i]_{kn}\ln [X^*_i]_{kn}) \\ +&{} ([X_i]_{nk}\ln [X_i]_{nk} - [X^*_i]_{nk}\ln [X^*_i]_{nk}) \\ +&{} ([X_i]_{nn}\ln [X_i]_{nn} - [X^*_i]_{nn}\ln [X^*_i]_{nn}) \\ <&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i +\frac{1}{2}\delta C_{kn}-\frac{1}{2}\delta C_{nk} \\ =&{} \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i = {\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma }). \end{array} \end{aligned}$$

The obtained inequalities \({\mathcal {W}}_{\gamma ,q_i}(p)<{\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma })\) contradict to the fact that \(p^*_{\gamma }\) is the barycenter; this proves the lemma. \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yufereva, O., Persiianov, M., Dvurechensky, P. et al. Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters. Comput Manag Sci 21, 12 (2024). https://doi.org/10.1007/s10287-023-00493-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10287-023-00493-9

Keywords

Navigation