Theoretical grounds of lambda optimization

№	Пользователь	Рейтинг
1	tourist	3757
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

№	Пользователь	Вклад
1	maomao90	171
2	awoo	165
3	adamant	163
4	TheScrasse	159
5	maroonrk	155
6	nor	154
7	-is-this-fft-	152
8	Petr	147
9	orz	145
9	pajenegod	145

Hi everyone!

This time I'd like to write about what's widely known as "Aliens trick" (as it got popularized after 2016 IOI problem called Aliens). There are already some articles about it here and there, and I'd like to summarize them, while also adding insights into the connection between this trick and generic Lagrange multipliers and Lagrangian duality which often occurs in e.g. linear programming problems.

Familiarity with a previous blog about ternary search or, at the very least, definitions and propositions from it is expected.

Great thanks to mango_lassi and 300iq for useful discussions and some key insights on this.

Tldr.

Let $$$f : X \to \mathbb R$$$ and $$$g : X \to \mathbb R^c$$$. You need to solve the constrained optimization problem

$$$\begin{gather}f(x) \to \min,\\ g(x) = 0.\end{gather}$$$

Let $$$t(\lambda) = \inf_x [f(x) - \lambda \cdot g(x)]$$$. Finding $$$t(\lambda)$$$ is unconstrained problem and is usually much simpler.

It may be rewritten as $$$t(\lambda) = \inf_y [h(y) - \lambda \cdot y]$$$ where $$$h(y)$$$ is the minimum possible $$$f(x)$$$ subject to $$$g(x)=y$$$.

This is applicable if $$$h(y)$$$ is convex, as $$$t(\lambda)$$$ defines a supporting hyperplane of $$$h(y)$$$'s graph with normal vector $$$(-\lambda, 1)$$$.

For convex $$$h(y)$$$, there is a certain monotonicity between $$$\lambda_i$$$ and $$$y_i$$$, so you can find $$$\lambda$$$ corresponding to $$$y=0$$$ with a binary search.

If $$$g(x)$$$ and $$$f(x)$$$ are integer functions, a binary search is sometimes doable over integers with $$$\lambda_i$$$ corresponding to $$$h(y_i) - h(y_i-1)$$$.

Boring and somewhat rigorous explanation is below, problem examples are belower.

Lagrange duality

Let $$$f : X \to \mathbb R$$$ be the objective function and $$$g : X \to \mathbb R^c$$$ be the constraint function. The constrained optimization problem

$$$\begin{gather}f(x) \to \min,\\ g(x) = 0\end{gather}$$$

in some cases can be reduced to finding stationary points of the Lagrange function

$$$L(x, \lambda) = f(x) - \lambda \cdot g(x).$$$

Here $$$\lambda \cdot g(x)$$$ is a dot product of $$$g(x)$$$ and a variable vector $$$\lambda \in \mathbb R^c$$$.

Def. 1. A function $$$L(x, \lambda)$$$ defined above is called the Lagrange function, or the Lagrangian.

Def. 2. A vector $$$\lambda \in \mathbb R^c$$$ defined above is called a Lagrange multiplier. Its components are collectively called Lagrange multipliers.

Mathematical optimization focuses on finding stationary points of $$$L(x,\lambda)$$$. However, we're more interested in its infimal projection

$$$t(\lambda) = \inf\limits_{x \in X} L(x,\lambda).$$$

Def. 3. A function $$$t(\lambda)$$$ defined above is called the Lagrange dual function.

Note that $$$t(\lambda)$$$, as a point-wise infimum of concave (linear in $$$\lambda$$$) functions, is always concave, no matter what $$$X$$$ and $$$f$$$ are.

If $$$x^*$$$ is the solution to the original problem, then $$$t(\lambda) \leq L(x^*,\lambda)=f(x^*)$$$ for any $$$\lambda \in \mathbb R^c$$$. This allows us to introduce

Def. 4. The unconstrained optimization problem $$$t(\lambda) \to \max$$$ is called the Lagrangian dual problem.

Def. 5. If $$$\lambda^*$$$ is the solution to the dual problem, the value $$$f(x^*) - t(\lambda^*)$$$ is called the duality gap.

Def. 6. A condition when the duality gap is zero is called a strong duality.

Typical example here is Slater's condition, which says that strong duality holds if $$$f(x)$$$ is convex and there exists $$$x$$$ such that $$$g(x)=0$$$.

Change of domain

In competitive programming, the set $$$X$$$ in definitions above is often weird and very difficult to analyze directly, so Slater's condition is not applicable. As a typical example, $$$X$$$ could be the set of all possible partitions of $$$\{1,2,\dots, n\}$$$ into non-intersecting segments.

To mitigate this, we define $$$h(y)$$$ as the minimum value of $$$f(x)$$$ subject to $$$g(x)=y$$$. In this notion, the dual function is written as

$$$t(\lambda) = \inf\limits_{y \in Y} [h(y) - \lambda \cdot y],$$$

where $$$Y=\{ g(x) : x \in X\}$$$. The set $$$Y$$$ is usually much more regular than $$$X$$$, as just by definition it is already a subset of $$$\mathbb R^c$$$. The strong duality condition is also very clear in this terms: it holds if and only if $$$0 \in Y$$$ and there is a $$$\lambda$$$ for which $$$y=0$$$ delivers infimum.

Geometrically $$$t(\lambda)$$$ defines a level at which the epigraph of $$$h(y)$$$, i. e. the set $$$\{(y,z) : z \geq h(y)\}$$$ has a supporting hyperplane with the normal vector $$$(-\lambda, 1)$$$. Indeed, the half-space bounded by such hyperplane on the level $$$c$$$ is defined as

$$$\{(y, z) : z-\lambda \cdot y \geq c\}.$$$

With $$$c=t(\lambda) > -\infty$$$, all the points at which the hyperplane touches the epigraph would correspond to infimum. Please, refer to the picture below. Lower $$$c$$$ would move the hyperplane lower, while higher $$$c$$$ would move it upper. With $$$c=t(\lambda)$$$, the hyperplane is lowest possible while still intersecting the epigraph of the function in the point $$$(y^*, h(y^*))$$$ where $$$y^*$$$ delivers the minimum of $$$h(y) - \lambda \cdot y$$$.

Competitive programming problems typically assume variable $$$y$$$ in the input, so for strong duality to hold for all inputs, all $$$y \in Y$$$ should have a supporting hyperplane that touches the epigraph in the point $$$(y, h(y))$$$. This condition is equivalent to $$$h(y)$$$ being convex-extensible, that is, there should exist a convex function on $$$\mathbb R^c$$$ such that its restriction to $$$Y$$$ yields $$$h(y)$$$.

We will not distinguish convex and convex-extensible functions further.

Decomposing into nested $$$1$$$-dimensional problems

Assume that the problem is $$$1$$$-dimensional. While calculating $$$t(\lambda)$$$, we typically find $$$x_\lambda$$$ that delivers the infimum of $$$f(x) - \lambda \cdot g(x)$$$. But at the same time we find $$$y_\lambda=g(x_\lambda)$$$ that delivers the infimum of $$$h(y) - \lambda \cdot y$$$. As $$$h(y)$$$ is convex, larger $$$y$$$ would require larger values of $$$\lambda$$$ in the supporting plane, thus it is possible to do a binary search on $$$\lambda$$$ to find the one corresponding to $$$y=0$$$.

If the problem is multi-dimensional, it can be decomposed in a nested set of $$$1$$$-dimensional problems. Let's separate everything related to the first dimension of $$$\lambda$$$ and $$$y$$$ vectors, so that $$$\lambda = (\lambda_1, \lambda_2) \in \mathbb R \times \mathbb R^{c-1}$$$ and $$$y=(y_1, y_2) \in Y_1 \times Y_2$$$. Then

$$$\max \limits_{\lambda_1,\lambda_2} t(\lambda_1,\lambda_2) = \max\limits_{\lambda_1} \max\limits_{\lambda_2} \inf\limits_{x}[f(x)-\lambda_1 \cdot g_1(x) - \lambda_2 \cdot g_2(x)]=\max\limits_{\lambda_1} t(\lambda_1),$$$

where $$$t(\lambda_1)$$$ is defined as

$$$t(\lambda_1) = \max\limits_{\lambda_2} \inf\limits_x [f(x) - \lambda_1 \cdot g_1(x) - \lambda_2 \cdot g_2(x)]= \max\limits_{\lambda_2} \inf\limits_x [f_{\lambda_1}(x) - \lambda_2 \cdot g_2(x)]=\max\limits_{\lambda_2}t_{\lambda_1}(\lambda_2).$$$

Correspondingly we may define $$$h_{\lambda_1}(y_2)$$$ as the minimum possible value of $$$f_{\lambda_1}(x)$$$ subject to $$$g_2(x) = y_2$$$. On the other hand,

$$$h_{\lambda_1}(y_2) = \inf\limits_{y_1} [h(y_1, y_2) - \lambda_1 \cdot y_1].$$$

It can be proven that when $$$h(y_1,y_2)$$$ is convex, $$$h_{\lambda_1}(y_2)$$$ is convex as well, as an infimal projection of convex function. It means that strong duality still holds for $$$t_{\lambda_1}(\lambda_2)$$$. Thus, we can do a nested binary search to find the $$$\lambda$$$ that corresponds to $$$y=0$$$:

void adjust(double *lmb, int i, int c) {
    if(i == c) {
        return;
    }
    double l = -inf, r = +inf; // some numbers that are large enough
    while(r - l > eps) {
        double m = (l + r) / 2;
        lmb[i] = m;
        adjust(lmb, i + 1, c);
        auto [xl, yl] = solve(lmb, i); // returns (x_lambda, y_lambda) with the minimum y_lambda[i]
        if(yl[k] < 0) {
            l = m;
        } else {
            r = m;
        }
    }
}

Note that a concrete $$$\lambda_i$$$ might correspond to the contiguous segment of $$$y_i$$$ and vice versa, thus we should find the $$$(x_\lambda,y_\lambda)$$$ pair with the minimum possible $$$i$$$-th component of $$$y_\lambda$$$ to guarantee the monotonicity. Otherwise, binary search might work in an unexpected way.

Integer search

Here and onwards we refer to $$$h(y_i)$$$ as a function of single real variable $$$y_i$$$, assuming that all other components of $$$y$$$ are fixed.

What are the possible $$$y_i$$$ for specific $$$\lambda_i$$$? By definition, it implies $$$h(y_i)-\lambda y_i = h(y'_i) - \lambda y'_i$$$, thus the relationship between them is

$$$\frac{h(y'_i)-h(y_i)}{y'_i-y_i}=\lambda_i.$$$

What are the possible $$$\lambda_i$$$ for specific $$$y_i$$$? When $$$h(y_i)$$$ is continuously differentiable, it essentially means that $$$\lambda_i$$$ corresponds to $$$y_i$$$ such that $$$\lambda_i = h'(y_i)$$$. On the other hand, when $$$Y$$$ is the set of integers, $$$y_i$$$ optimizes $$$t(\lambda)$$$ for all $$$\lambda_i$$$ such that

$$$h(y_i) - h(y_i-1) \leq \lambda_i \leq h(y_i + 1) - h(y_i).$$$

So, if $$$h(y_i)$$$ is an integer function, we may do the integer search of $$$\lambda_i$$$ on possible values of $$$h(k)-h(k-1)$$$ only.

In a very generic case, when the function is not continuously differentiable and $$$y_i$$$ are not necessarily integer, the set of possible $$$\lambda_i$$$ for a given $$$y_i$$$ is defined as $$$\partial h(y_i)$$$, the so-called sub-differential of $$$h$$$ in $$$y_i$$$, formally defined as $$$[a,b]$$$ where

$$$\begin{gather}a = \sup_{y'_i < y_i} \frac{f(y_i) - f(y'_i)}{y_i - y'_i},~~b = \inf_{y_i < y'_i} \frac{f(y'_i) - f(y_i)}{y'_i - y_i}\end{gather}.$$$

The concept of sub-derivatives and sub-differentials can be generalized to multi-dimensional case as well with sub-gradients.

Testing convexity

When $$$Y$$$ is continuous set, convexity may be tested by local criteria. For one-dimensional set, $$$h(y)$$$ is convex if and only if $$$h'(y)$$$ is non-decreasing. If it has a second derivative, we might also say that the function is convex if and only if $$$h' '(y) \geq 0$$$ for all $$$y$$$.

In multidimensional case, the local criterion is that the Hessian matrix $$$\frac{\partial h}{\partial y_i \partial y_j}$$$ is a positive semi-definite.

In discrete case, derivatives can be substituted with finite differences. A function $$$h(y)$$$ defined on $$$\mathbb Z$$$ is convex if and only if

$$$\Delta [h] (y) = h(y+1)-h(y)$$$

is non-decreasing or, which is equivalent,

$$$\Delta^2 [h] (y) = h(y+2)-2h(y+1)+h(y) \geq 0.$$$

Another possible way to prove that $$$1$$$-dimensional $$$h(y)$$$ is convex in relevant problems is to reduce it to finding min-cost flow of size $$$y$$$ in some network. The common algorithm to find such flow pushes a flow of size $$$1$$$ through the shortest path in the residual network, which becomes smaller when $$$y$$$ rises. This method was revealed to me by 300iq.

In multi-dimensional case testing convexity might be challenging. Instead of this, one may decompose the problem into a nested set of $$$1$$$-dimensional problems such that for each problem $$$\lambda$$$-optimization is applicable. For this, you would need to prove that TODO TODO TODO

Problem examples

For simplicity, we will use $$$g(x)=y_0$$$ in primal problem instead of $$$g(x)-y_0=0$$$, while also omitting the constant part in dual function. It leads to the following changes with respect to the written above:

Binary search target is set to $$$y=y_0$$$ instead of $$$y=0$$$;
The primal solution is determined by $$$f(x^*) = t(\lambda^*) + \lambda \cdot y_0$$$ rather than $$$f(x^*) = t(\lambda^*)$$$.

Sometimes the problem is stated with $$$\max$$$ rather than $$$\min$$$. In this case, $$$t(\lambda)$$$ and $$$h(y)$$$ are defined as supremum and maximum rather than infimum and minimum. Correspondingly, $$$h(y)$$$ needs to be concave rather than convex to allow the usage of the trick.

Gosha is hunting. You're given $$$a$$$, $$$b$$$, $$$p_1, \dots, p_n$$$ and $$$q_1,\dots, q_n$$$. You need to pick two subsets $$$A$$$ and $$$B$$$ of $$$\{1,2,\dots,n\}$$$ of size $$$a$$$ and $$$b$$$ correspondingly in such a way that the following sum is maximized:

$$$f(A, B) = \sum\limits_{i \in A} p_i + \sum\limits_{j \in B} q_j - \sum\limits_{k \in A \cap B} p_k q_k.$$$

Formulating it as a constrained optimization problem, we obtain

$$$\begin{gather}f(A, B) \to \max, \\ |A| = a, \\ |B| = b.\end{gather}$$$

Lagrangian dual here is

$$$t(\lambda_1, \lambda_2) = \max\limits_{A,B} [f(A, B) - \lambda_1|A| - \lambda_2|B|].$$$

Let $$$h(\alpha,\beta) = \max f(A, B)$$$ subject to $$$|A| = \alpha$$$ and $$$|B| = \beta$$$, then

$$$t(\lambda_1, \lambda_2) = \max\limits_{\alpha,\beta}[h(\alpha,\beta)-\lambda_1 \alpha - \lambda_2 \beta].$$$

References

Duality (optimization) — English Wikipedia
The Trick From Aliens — Serbanology
My Take on Aliens' Trick — Mamnoon Siam's Blog
Incredibly beautiful DP optimization from N^3 to N log^2 N
Comment on Codeforces by _h_
Brief conversation with 300iq
Part of the article was once revealed to me in a dream

Rev.	Кто	Когда	Δ	Комментарий
en68	adamant	2023-10-11 00:47:07	8
en67	adamant	2022-02-13 16:55:43	33
en66	adamant	2022-01-04 14:57:40	25
en65	adamant	2022-01-04 05:03:24	6363	+ honorable mention
en64	adamant	2022-01-04 04:20:18	8	articles
en63	adamant	2022-01-04 04:18:39	354	example 3
en62	adamant	2022-01-04 04:00:52	748	tldr structured
en61	adamant	2022-01-04 03:39:52	33
en60	adamant	2022-01-04 03:38:28	721	example, part 2
en59	adamant	2022-01-04 03:25:48	784
en58	adamant	2022-01-04 03:18:14	1414	example
en57	adamant	2022-01-04 00:21:04	4
en56	adamant	2022-01-04 00:20:45	570	better code for min_conv
en55	adamant	2022-01-03 15:20:41	472	clarified tldr
en54	adamant	2022-01-03 03:37:09	688	code for max-conv of concave functions
en53	adamant	2022-01-03 01:11:10	43	link
en52	adamant	2022-01-03 01:07:50	30
en51	adamant	2022-01-03 01:06:56	12
en50	adamant	2022-01-03 01:02:56	1815	+ example
en49	adamant	2022-01-02 20:14:13	160	sections in testing convexity
en48	adamant	2022-01-02 13:29:02	129
en47	adamant	2022-01-02 13:26:24	104	(published)
en46	adamant	2022-01-02 13:21:47	3709
en45	adamant	2022-01-02 12:55:16	690
en44	adamant	2022-01-02 01:54:22	24
en43	adamant	2022-01-02 01:53:53	710
en42	adamant	2022-01-02 01:01:41	210
en41	adamant	2022-01-02 00:54:43	643
en40	adamant	2022-01-02 00:49:11	1627
en39	adamant	2022-01-02 00:09:54	8779
en38	adamant	2022-01-01 22:34:20	1161
en37	adamant	2022-01-01 16:34:48	296
en36	adamant	2022-01-01 16:33:24	4
en35	adamant	2022-01-01 16:29:44	22
en34	adamant	2022-01-01 16:27:27	131
en33	adamant	2022-01-01 16:26:19	173
en32	adamant	2022-01-01 16:23:52	31
en31	adamant	2022-01-01 16:23:22	865
en30	adamant	2021-12-29 05:34:22	1201
en29	adamant	2021-12-29 03:47:08	332
en28	adamant	2021-12-29 02:45:56	2366
en27	adamant	2021-12-28 22:09:05	1383
en26	adamant	2021-12-28 19:28:41	1585
en25	adamant	2021-12-28 18:57:44	52
en24	adamant	2021-12-28 18:44:26	5
en23	adamant	2021-12-28 18:43:58	915
en22	adamant	2021-12-28 18:28:53	112
en21	adamant	2021-12-28 18:25:09	218
en20	adamant	2021-12-28 18:18:48	1224	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en19	adamant	2021-12-28 17:51:33	66
en18	adamant	2021-12-28 17:50:21	316	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en17	adamant	2021-12-28 17:35:06	409	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en16	adamant	2021-12-28 17:18:07	81	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en15	adamant	2021-12-28 17:14:07	1049	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en14	adamant	2021-12-28 16:12:52	96	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en13	adamant	2021-12-28 16:01:48	338	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en12	adamant	2021-12-28 15:54:43	903	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en11	adamant	2021-12-28 04:38:22	2	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en10	adamant	2021-12-28 04:35:18	118	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en9	adamant	2021-12-28 04:23:41	1406	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en8	adamant	2021-12-28 03:18:55	333	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en7	adamant	2021-12-28 02:59:30	322	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en6	adamant	2021-12-28 01:40:50	2845	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en5	adamant	2021-12-27 22:08:35	26	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en4	adamant	2021-12-26 19:06:11	0	Tiny change: 'blem\n\n$$f(x)→ming(x)=0f(x)→ming(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en3	adamant	2021-12-26 01:42:07	1888	Tiny change: 'blem\n\n$$f(x)→extrg(x)=0f(x)→extrg(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en2	adamant	2021-12-25 16:27:58	176	Tiny change: 'blem\n\n$$f(x)→extrg(x)=0f(x)→extrg(x)=0\begin{gat' -> 'blem\n\n$$\begin{gat'
en1	adamant	2021-12-25 16:21:30	2909	Initial revision (saved to drafts)