Published Jun 2026Reading Time: 09 MinCategory: cs.LG

Escaping the MAB prison: fast KL-projection for linearly constrained continuous actions

I am tired of MAB. I want something fancy.

This post comes from a random thought I had on a so-and-so scenario with a linear constraint. Doesn't sound like a big deal. Linear constraints are mathematically nice. But when the constraint is tied directly to a massive budget in production, you sometimes just fail at fear management and want to take the coward's way out.

That's exactly what we did. We created discrete presets of actions that safely satisfied the constraint. The agent (or algorithm, or model, whatever you want to call this probabilistic decision engine) just picked the best one based on context.

If you're an RL practitioner, this probably sounds familiar. It's an old friend from 1952 called the multi-armed bandit (MAB). Staring down this old friend, a natural curiosity would be: is there a fancy way out? Could we earn the model a bit more freedom? Can we do things safely in the continuum?

1. Formulation of the constrained continuous action space

To grant the policy network full expressiveness, we have to let it output a raw, continuous probability distribution $\hat{\theta}$ over a high-dimensional action space. But this output has to strictly adhere to the linear budget constraints.

Suppose we have $K$ strict resource limits. In matrix form, the realized distribution $\theta$ must satisfy $A \theta = b$ . Here $A \in \mathbb{R}^{K \times N}$ is our feature or cost matrix, and $b \in \mathbb{R}^K$ is the vector of target expected values.

The challenge is constructing a projection $\Pi(\hat{\theta}) \rightarrow \theta^*$ that maps the network's unconstrained output onto the $A \theta = b$ hyperplane.

2. Think twice before you scream quadratic programming (QP)

When projecting a point onto a linear subspace, the standard move is to minimize Euclidean distance. This gives a quadratic programming (QP) problem.

$\min_{\theta} \frac{1}{2} ||\theta - \hat{\theta}||_2^2$

$\text{subject to } A \theta = b \text{ and } \sum \theta_i = 1$

This is the Pavlovian Response of anyone working with optimization. It’s very straightforward.

However, applying Euclidean projection to a probability distribution relies on a fundamentally flawed topological assumption: it treats the probability simplex $\Delta^{N-1}$ as a flat Euclidean space.

The Euclidean metric is insensitive to the boundaries of the simplex. When a strict constraint hyperplane $A\theta = b$ forces the unconstrained prior $\hat{\theta}$ outside the feasible positive orthant, the Euclidean projection will naturally yield negative values. To restore validity ( $\theta \succeq 0$ ), the QP solver's active-set method forcibly projects the solution onto the boundary of the simplex. It drives smaller and long-tail probabilities to absolute zero ( $\theta_i \to 0$ ). In optimization literature, these are known as "corner solutions."

While acceptable in standard operations research, in Reinforcement Learning, gradient sparsity is bad bad news. The policy gradient estimator relies fundamentally on the log-probability of an action, $\nabla_\phi \log \pi_\phi(a|s)$ . If the QP projection layer deterministically forces an action's probability to $0$ , it triggers the following downward spiral:

Zero Probability: Some actions may never get sampled.
Absence of Reward Signal: Because the action is never selected, the agent collects no environmental transitions $(s, a, r, s')$ for that specific state-action pair.
Vanishing Gradients: Without a reward signal, and because $\lim_{\theta \to 0} \log(\theta) = -\infty$ disrupts the gradient calculation, no meaningful gradient flows backward through the network for that dimension.

The policy network becomes permanently blind to that region of the action space.

3. Information geometry and the analytical proof of the Esscher transform

To resolve the vanishing gradient problem inherent to Euclidean projections, we must analyze the problem on the statistical manifold of the exponential family. The natural geometric distance metric between probability distributions is not a straight line, but the Kullback-Leibler (KL) divergence.

Information geometry (Amari, 2016) demonstrates that the squared Euclidean distance is merely a localized, second-order Taylor expansion of the KL divergence, assuming an isotropic Fisher Information Matrix ( $H \approx I$ ). This approximation catastrophically breaks down near the boundaries of the probability simplex, which is precisely why QP solvers force probabilities to zero.

To preserve the topological structure of the prior distribution $\hat{\theta}$ , we formulate the primal optimization problem as minimizing the KL divergence subject to our generalized linear constraints:

$\min_{\theta} D_{KL}(\theta \parallel \hat{\theta}) = \sum_{i=1}^N \theta_i \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right)$

$\text{subject to } \quad A \theta = b \quad \text{and} \quad \mathbf{1}^T\theta = 1$

Proof of the primal solution

We construct the Lagrangian. We introduce a multiplier vector $\vec{\lambda} \in \mathbb{R}^K$ for the strict budget constraints and a scalar $\alpha$ for the simplex constraint.

$\mathcal{L}(\theta, \vec{\lambda}, \alpha) = \sum_{i=1}^N \theta_i \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right) + \vec{\lambda}^T (b - A \theta) + \alpha \left(1 - \sum_{i=1}^N \theta_i\right)$

Taking the partial derivative with respect to each probability component $\theta_i$ and setting it to zero yields:

$\frac{\partial \mathcal{L}}{\partial \theta_i} = \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right) + 1 - (A^T \vec{\lambda})_i - \alpha = 0$

Solving for $\theta_i$ , we obtain:

$\theta_i = \hat{\theta}_i \exp\left( (A^T \vec{\lambda})_i + \alpha - 1 \right)$

We can extract the terms independent of $i$ into a normalization constant. Let us define the partition function $Z(\vec{\lambda}) = \exp(1 - \alpha) = \sum_{j=1}^N \hat{\theta}_j \exp\left( (A^T \vec{\lambda})_j \right)$ . The resulting optimal distribution $\theta^*$ is the Generalized Esscher Transform, also known as exponential tilting:

$\theta^*_i(\vec{\lambda}) = \frac{\hat{\theta}_i \exp\left( (A^T \vec{\lambda})_i \right)}{Z(\vec{\lambda})}$

Or, expressed in vectorized form:

$\theta^*(\vec{\lambda}) = \frac{1}{Z(\vec{\lambda})} \hat{\theta} \circ \exp(A^T \vec{\lambda})$

This proof reveals a profound geometric property. Because the transformation relies strictly on an exponential scaling of the prior, any candidate action with a strictly positive prior probability ( $\hat{\theta}_i > 0$ ) is mathematically guaranteed to retain a strictly positive projected probability ( $\theta^*_i > 0$ ). The projection function never intersects the axes of the positive orthant. Zero-probability clipping is entirely eliminated, and policy gradients survive globally.

4. The dual proof: strict convexity and global convergence

While we have derived the analytical functional form of $\theta^*(\vec{\lambda})$ , we must still compute the unique Lagrange multiplier vector $\vec{\lambda}^*$ that ensures the tilted distribution exactly satisfies $A \theta^* = b$ . We achieve this by optimizing the unconstrained Lagrangian dual function.

Let us define the log-partition function (the cumulant-generating function) $\mathcal{A}(\vec{\lambda}) = \ln Z(\vec{\lambda})$ . The dual objective function we wish to minimize is:

$g(\vec{\lambda}) = \mathcal{A}(\vec{\lambda}) - \vec{\lambda}^T b$

Proof of the gradient (first moment)

By the fundamental properties of the exponential family, the derivative of the log-partition function yields the expected value of the sufficient statistics. Let's prove this explicitly by taking the gradient of $\mathcal{A}(\vec{\lambda})$ with respect to the $k$ -th multiplier $\lambda_k$ :

$\frac{\partial \mathcal{A}(\vec{\lambda})}{\partial \lambda_k} = \frac{\partial}{\partial \lambda_k} \ln Z(\vec{\lambda}) = \frac{1}{Z(\vec{\lambda})} \frac{\partial}{\partial \lambda_k} \sum_{i=1}^N \hat{\theta}_i \exp\left( \sum_{m=1}^K A_{mi}\lambda_m \right)$

$\frac{\partial \mathcal{A}(\vec{\lambda})}{\partial \lambda_k} = \sum_{i=1}^N A_{ki} \left[ \frac{\hat{\theta}_i \exp( (A^T\vec{\lambda})_i )}{Z(\vec{\lambda})} \right] = \sum_{i=1}^N A_{ki} \theta^*_i = \mathbb{E}_{\theta^*}[A_k]$

Therefore, the full gradient of our dual objective $g(\vec{\lambda})$ represents the exact residual error between the current expected values and the target budgets:

$\nabla_{\vec{\lambda}} g(\vec{\lambda}) = \mathbb{E}_{\theta^*}[A] - b$

To minimize $g(\vec{\lambda})$ , we must find the root where the gradient is zero, i.e., $\mathbb{E}_{\theta^*}[A] = b$ .

Proof of the Hessian and strict convexity (second moment)

To construct a Newton-Raphson solver, we require the Hessian matrix. Taking the second partial derivative of $\mathcal{A}(\vec{\lambda})$ with respect to $\lambda_k$ and $\lambda_m$ :

$\frac{\partial^2 \mathcal{A}(\vec{\lambda})}{\partial \lambda_k \partial \lambda_m} = \frac{\partial}{\partial \lambda_m} \sum_{i=1}^N A_{ki} \theta^*_i$

Applying the quotient rule to $\theta^*_i = \frac{\hat{\theta}_i \exp((A^T \vec{\lambda})_i)}{Z(\vec{\lambda})}$ reveals its derivative with respect to $\lambda_m$ :

$\frac{\partial \theta^*_i}{\partial \lambda_m} = \frac{\left[ \hat{\theta}_i \exp((A^T \vec{\lambda})_i) A_{mi} \right] Z(\vec{\lambda}) - \left[ \hat{\theta}_i \exp((A^T \vec{\lambda})_i) \right] \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m}}{Z(\vec{\lambda})^2}$

$= \theta^*_i A_{mi} - \theta^*_i \frac{1}{Z(\vec{\lambda})} \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m}$

Since we established in the gradient proof that $\frac{1}{Z(\vec{\lambda})} \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m} = \mathbb{E}_{\theta^*}[A_m]$ , this simplifies to:

$\frac{\partial \theta^*_i}{\partial \lambda_m} = \theta^*_i (A_{mi} - \mathbb{E}_{\theta^*}[A_m])$

Substituting this back into our Hessian equation:

$\frac{\partial^2 \mathcal{A}(\vec{\lambda})}{\partial \lambda_k \partial \lambda_m} = \sum_{i=1}^N A_{ki} \theta^*_i (A_{mi} - \mathbb{E}_{\theta^*}[A_m]) = \mathbb{E}_{\theta^*}[A_k A_m] - \mathbb{E}_{\theta^*}[A_k]\mathbb{E}_{\theta^*}[A_m]$

Lifting this from element-wise to matrix form:

$\nabla^2_{\vec{\lambda}} g(\vec{\lambda}) = A \, \text{diag}(\theta^*) A^T - (A \theta^*)(A \theta^*)^T = \text{Cov}_{\theta^*}(A) = \Sigma$

This is the exact definition of covariance. The Hessian of the dual objective is the $K \times K$ Covariance Matrix $\Sigma$ of the constraint features under the tilted distribution.

The convergence theorem

Because a covariance matrix is inherently positive semi-definite ( $\Sigma \succeq 0$ ), and assuming the constraint features $A$ are linearly independent, the Hessian is strictly positive definite ( $\Sigma \succ 0$ ).

This mathematically guarantees that the dual objective $g(\vec{\lambda})$ is strictly convex globally, with a unique global minimum. The multivariate Newton-Raphson step:

$\vec{\lambda}_{t+1} = \vec{\lambda}_t - (\nabla^2_{\vec{\lambda}} g)^{-1} \nabla_{\vec{\lambda}} g = \vec{\lambda}_t - \Sigma^{-1} (\mathbb{E}_{\theta^*}[A] - b)$

achieves a quadratic rate of convergence in the neighborhood of the root. Far from the optimum the undamped step can overshoot, so in practice we wrap it in a backtracking line search — scale the step by $t_k \in (0, 1]$ and halve until $g(\vec{\lambda})$ actually decreases. The unit step is accepted after the first few iterations, and from there the local quadratic rate takes over.

Here is the algorithmic loop in Python:

def esscher_projection(prior_p, A, target_b, max_iters=10, tol=1e-7):
    """
    prior_p:  (N,)  Raw prior probabilities from the policy network
    A:        (K, N) Constraint feature matrix
    target_b: (K,)  Target budgets/expected values
    """
    lam = zeros(K) # Initialize K-dimensional Lagrangian multipliers

    for _ in range(max_iters):
        # 1. Exponential tilting
        tilted_logits = log(prior_p) + (A.T @ lam)
        p_star = softmax(tilted_logits)

        # 2. Compute Gradient (Expected Values & Error)
        ev = A @ p_star
        error = ev - target_b

        if norm(error) < tol:
            break

        # 3. Compute Hessian (Covariance Matrix)
        cov = (A * p_star) @ A.T - outer(ev, ev)

        # 4. Multi-dimensional Newton update
        # (Wrap with a backtracking line search on the step size — halve
        #  until g(lam) decreases — to guard overshoot far from the optimum.)
        lam = lam - inv(cov) @ error

    return p_star

The convergence guarantee from the last section already does most of the work. Quadratic convergence means a handful of iterations. The iteration count comes bounded for free. You can now put Adam, RMSprop, SGD, and everything for shit objective functions back on the shelf. The Armijo constant in the line search is the only knob, and it does not move the needle.

5. Making it faster if you really need to

A couple ways to push this further if you are one of those runtime freaks:

JIT it. Wrap esscher_projection with @jit — Numba, JAX, or torch.compile, whichever your stack already uses. Tight numerical kernel, no Python overhead. Easiest path.
Drop to C++/CUDA. If JIT isn't enough, push the whole projection layer down to a raw extension. Nuclear option, but then you probably have to write the entire thing in cpp too, because a python wrapper can make things slow.
FP32, not FP64. 32-bit floats hold 7 decimal digits of precision, more than any physical constraint cares about. Half the memory bandwidth.
Single-pass covariance. Keep precomputed second-moment buffers around. Skip the redundant matmuls.

6. Practical considerations and architectural adaptations

A few practical approaches/considerations when you actually use this thing.

Dimensionality reduction. If the action space has tens of thousands of items, inverting that big a covariance matrix is infeasible. So we compress the raw actions into a lower-dimensional basis registry ( $N=30$ or so). The Brain outputs weights for these basis functions. The Oracle tilts the basis functions, not the raw items.

Discretization bias. Evaluating continuous functions on discrete lookup tables introduces numerical drift. We fix this by computing the effective empirical means of the basis functions against the actual discrete payload. Those calibrated values feed into the Newton solver. Adherence stays exact.

0 times anything is zero. The Esscher transform is multiplicative scaling ( $p^* \propto \hat{p} \circ \exp(\dots)$ ). It cannot project UP from zero. $0 \times \text{anything} = 0$ . If the Brain outputs an exact zero prior for an action, it stays zero after projection. Even when resurrecting that probability is necessary to satisfy the constraint. So no ReLU. Nothing that produces hard zeros. Route all raw network outputs through a softmax. Strict positivity before the Oracle sees them.

Gradient alignment. The projection may greatly change the direction of the gradient. We can potentially fix it by adding a KL penalty to the loss (projection_kl) that compares pre and post-projection distributions. The alignment loss forces the Brain to feel the pain of the projection. It learns to output distributions that naturally sit closer to the constraint hyperplane.

7. What can this be used for?

Come on, it’s a mechanism for continuous decision-making under strict linear constraint. It applies broadly.

References

Amari, S. I. (2016). Information Geometry and Its Applications (Vol. 194). Springer.

Next Piece