← Return to IndexYvonne's Portfolio — 01 / Blog
Published May 2026Reading Time: 10 MinCategory: cs.LG

Escaping the MAB prison: fast KL-projection for linearly constrained continuous actions

I am tired of MAB. I want something fancy.

This post comes from a random thought I had on a so-and-so scenario with a linear constraint. Doesn't sound like a big deal. Linear constraints are mathematically nice. But when the constraint is tied directly to a massive budget in production, you sometimes just fail at fear management and want to take the coward's way out.

That's exactly what we did. We created discrete presets of actions that safely satisfied the constraint. The agent (or algorithm, or model, whatever you want to call this probabilistic decision engine) just picked the best one based on context.

If you're an RL practitioner, this probably sounds familiar. It's an old friend from 1952 called the multi-armed bandit (MAB). Staring down this old friend, a natural curiosity would be: is there a fancy way out? Could we earn the model a bit more freedom? Can we do things safely in the continuum?


1. Formulation of the constrained continuous action space

To grant the policy network full expressiveness, we have to let it output a raw, continuous probability distribution θ^\hat{\theta} over a high-dimensional action space. But this output has to strictly adhere to the linear budget constraints.

Suppose we have KK strict resource limits. In matrix form, the realized distribution θ\theta must satisfy Aθ=bA \theta = b. Here ARK×NA \in \mathbb{R}^{K \times N} is our feature or cost matrix, and bRKb \in \mathbb{R}^K is the vector of target expected values.

The challenge is constructing a projection Π(θ^)θ\Pi(\hat{\theta}) \rightarrow \theta^* that maps the network's unconstrained output onto the Aθ=bA \theta = b hyperplane.

2. Think twice before you scream quadratic programming (QP)

When projecting a point onto a linear subspace, the standard move is to minimize Euclidean distance. This gives a quadratic programming (QP) problem.

minθ12θθ^22\min_{\theta} \frac{1}{2} ||\theta - \hat{\theta}||_2^2

subject to Aθ=b and θi=1\text{subject to } A \theta = b \text{ and } \sum \theta_i = 1

This is the Pavlovian Response of anyone working with optimization. It’s very straightforward.

However, applying Euclidean projection to a probability distribution relies on a fundamentally flawed topological assumption: it treats the probability simplex ΔN1\Delta^{N-1} as a flat Euclidean space.

The Euclidean metric is insensitive to the boundaries of the simplex. When a strict constraint hyperplane Aθ=bA\theta = b forces the unconstrained prior θ^\hat{\theta} outside the feasible positive orthant, the Euclidean projection will naturally yield negative values. To restore validity (θ0\theta \succeq 0), the QP solver's active-set method forcibly projects the solution onto the boundary of the simplex. It drives smaller and long-tail probabilities to absolute zero (θi0\theta_i \to 0). In optimization literature, these are known as "corner solutions."

While acceptable in standard operations research, in Reinforcement Learning, gradient sparsity is bad bad news. The policy gradient estimator relies fundamentally on the log-probability of an action, ϕlogπϕ(as)\nabla_\phi \log \pi_\phi(a|s). If the QP projection layer deterministically forces an action's probability to 00, it triggers the following downward spiral:

  1. Zero Probability: Some actions may never get sampled.
  2. Absence of Reward Signal: Because the action is never selected, the agent collects no environmental transitions (s,a,r,s)(s, a, r, s') for that specific state-action pair.
  3. Vanishing Gradients: Without a reward signal, and because limθ0log(θ)=\lim_{\theta \to 0} \log(\theta) = -\infty disrupts the gradient calculation, no meaningful gradient flows backward through the network for that dimension.

The policy network becomes permanently blind to that region of the action space.

3. Information geometry and the analytical proof of the Esscher transform

To resolve the vanishing gradient problem inherent to Euclidean projections, we must analyze the problem on the statistical manifold of the exponential family. The natural geometric distance metric between probability distributions is not a straight line, but the Kullback-Leibler (KL) divergence.

Information geometry (Amari, 2016) demonstrates that the squared Euclidean distance is merely a localized, second-order Taylor expansion of the KL divergence, assuming an isotropic Fisher Information Matrix (HIH \approx I). This approximation catastrophically breaks down near the boundaries of the probability simplex, which is precisely why QP solvers force probabilities to zero.

To preserve the topological structure of the prior distribution θ^\hat{\theta}, we formulate the primal optimization problem as minimizing the KL divergence subject to our generalized linear constraints:

minθDKL(θθ^)=i=1Nθiln(θiθ^i)\min_{\theta} D_{KL}(\theta \parallel \hat{\theta}) = \sum_{i=1}^N \theta_i \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right)

subject to Aθ=band1Tθ=1\text{subject to } \quad A \theta = b \quad \text{and} \quad \mathbf{1}^T\theta = 1

Proof of the primal solution

We construct the Lagrangian. We introduce a multiplier vector λRK\vec{\lambda} \in \mathbb{R}^K for the strict budget constraints and a scalar α\alpha for the simplex constraint.

L(θ,λ,α)=i=1Nθiln(θiθ^i)+λT(bAθ)+α(1i=1Nθi)\mathcal{L}(\theta, \vec{\lambda}, \alpha) = \sum_{i=1}^N \theta_i \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right) + \vec{\lambda}^T (b - A \theta) + \alpha \left(1 - \sum_{i=1}^N \theta_i\right)

Taking the partial derivative with respect to each probability component θi\theta_i and setting it to zero yields:

Lθi=ln(θiθ^i)+1(ATλ)iα=0\frac{\partial \mathcal{L}}{\partial \theta_i} = \ln\left(\frac{\theta_i}{\hat{\theta}_i}\right) + 1 - (A^T \vec{\lambda})_i - \alpha = 0

Solving for θi\theta_i, we obtain:

θi=θ^iexp((ATλ)i+α1)\theta_i = \hat{\theta}_i \exp\left( (A^T \vec{\lambda})_i + \alpha - 1 \right)

We can extract the terms independent of ii into a normalization constant. Let us define the partition function Z(λ)=exp(1α)=j=1Nθ^jexp((ATλ)j)Z(\vec{\lambda}) = \exp(1 - \alpha) = \sum_{j=1}^N \hat{\theta}_j \exp\left( (A^T \vec{\lambda})_j \right). The resulting optimal distribution θ\theta^* is the Generalized Esscher Transform, also known as exponential tilting:

θi(λ)=θ^iexp((ATλ)i)Z(λ)\theta^*_i(\vec{\lambda}) = \frac{\hat{\theta}_i \exp\left( (A^T \vec{\lambda})_i \right)}{Z(\vec{\lambda})}

Or, expressed in vectorized form:

θ(λ)=1Z(λ)θ^exp(ATλ)\theta^*(\vec{\lambda}) = \frac{1}{Z(\vec{\lambda})} \hat{\theta} \circ \exp(A^T \vec{\lambda})

This proof reveals a profound geometric property. Because the transformation relies strictly on an exponential scaling of the prior, any candidate action with a strictly positive prior probability (θ^i>0\hat{\theta}_i > 0) is mathematically guaranteed to retain a strictly positive projected probability (θi>0\theta^*_i > 0). The projection function never intersects the axes of the positive orthant. Zero-probability clipping is entirely eliminated, and policy gradients survive globally.

4. The dual proof: strict convexity and global convergence

While we have derived the analytical functional form of θ(λ)\theta^*(\vec{\lambda}), we must still compute the unique Lagrange multiplier vector λ\vec{\lambda}^* that ensures the tilted distribution exactly satisfies Aθ=bA \theta^* = b. We achieve this by optimizing the unconstrained Lagrangian dual function.

Let us define the log-partition function (the cumulant-generating function) A(λ)=lnZ(λ)\mathcal{A}(\vec{\lambda}) = \ln Z(\vec{\lambda}). The dual objective function we wish to minimize is:

g(λ)=A(λ)λTbg(\vec{\lambda}) = \mathcal{A}(\vec{\lambda}) - \vec{\lambda}^T b

Proof of the gradient (first moment)

By the fundamental properties of the exponential family, the derivative of the log-partition function yields the expected value of the sufficient statistics. Let's prove this explicitly by taking the gradient of A(λ)\mathcal{A}(\vec{\lambda}) with respect to the kk-th multiplier λk\lambda_k:

A(λ)λk=λklnZ(λ)=1Z(λ)λki=1Nθ^iexp(m=1KAmiλm)\frac{\partial \mathcal{A}(\vec{\lambda})}{\partial \lambda_k} = \frac{\partial}{\partial \lambda_k} \ln Z(\vec{\lambda}) = \frac{1}{Z(\vec{\lambda})} \frac{\partial}{\partial \lambda_k} \sum_{i=1}^N \hat{\theta}_i \exp\left( \sum_{m=1}^K A_{mi}\lambda_m \right)

A(λ)λk=i=1NAki[θ^iexp((ATλ)i)Z(λ)]=i=1NAkiθi=Eθ[Ak]\frac{\partial \mathcal{A}(\vec{\lambda})}{\partial \lambda_k} = \sum_{i=1}^N A_{ki} \left[ \frac{\hat{\theta}_i \exp( (A^T\vec{\lambda})_i )}{Z(\vec{\lambda})} \right] = \sum_{i=1}^N A_{ki} \theta^*_i = \mathbb{E}_{\theta^*}[A_k]

Therefore, the full gradient of our dual objective g(λ)g(\vec{\lambda}) represents the exact residual error between the current expected values and the target budgets:

λg(λ)=Eθ[A]b\nabla_{\vec{\lambda}} g(\vec{\lambda}) = \mathbb{E}_{\theta^*}[A] - b

To minimize g(λ)g(\vec{\lambda}), we must find the root where the gradient is zero, i.e., Eθ[A]=b\mathbb{E}_{\theta^*}[A] = b.

Proof of the Hessian and strict convexity (second moment)

To construct a Newton-Raphson solver, we require the Hessian matrix. Taking the second partial derivative of A(λ)\mathcal{A}(\vec{\lambda}) with respect to λk\lambda_k and λm\lambda_m:

2A(λ)λkλm=λmi=1NAkiθi\frac{\partial^2 \mathcal{A}(\vec{\lambda})}{\partial \lambda_k \partial \lambda_m} = \frac{\partial}{\partial \lambda_m} \sum_{i=1}^N A_{ki} \theta^*_i

Applying the quotient rule to θi=θ^iexp((ATλ)i)Z(λ)\theta^*_i = \frac{\hat{\theta}_i \exp((A^T \vec{\lambda})_i)}{Z(\vec{\lambda})} reveals its derivative with respect to λm\lambda_m:

θiλm=[θ^iexp((ATλ)i)Ami]Z(λ)[θ^iexp((ATλ)i)]Z(λ)λmZ(λ)2\frac{\partial \theta^*_i}{\partial \lambda_m} = \frac{\left[ \hat{\theta}_i \exp((A^T \vec{\lambda})_i) A_{mi} \right] Z(\vec{\lambda}) - \left[ \hat{\theta}_i \exp((A^T \vec{\lambda})_i) \right] \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m}}{Z(\vec{\lambda})^2}

=θiAmiθi1Z(λ)Z(λ)λm= \theta^*_i A_{mi} - \theta^*_i \frac{1}{Z(\vec{\lambda})} \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m}

Since we established in the gradient proof that 1Z(λ)Z(λ)λm=Eθ[Am]\frac{1}{Z(\vec{\lambda})} \frac{\partial Z(\vec{\lambda})}{\partial \lambda_m} = \mathbb{E}_{\theta^*}[A_m], this simplifies to:

θiλm=θi(AmiEθ[Am])\frac{\partial \theta^*_i}{\partial \lambda_m} = \theta^*_i (A_{mi} - \mathbb{E}_{\theta^*}[A_m])

Substituting this back into our Hessian equation:

2A(λ)λkλm=i=1NAkiθi(AmiEθ[Am])=Eθ[AkAm]Eθ[Ak]Eθ[Am]\frac{\partial^2 \mathcal{A}(\vec{\lambda})}{\partial \lambda_k \partial \lambda_m} = \sum_{i=1}^N A_{ki} \theta^*_i (A_{mi} - \mathbb{E}_{\theta^*}[A_m]) = \mathbb{E}_{\theta^*}[A_k A_m] - \mathbb{E}_{\theta^*}[A_k]\mathbb{E}_{\theta^*}[A_m]

Lifting this from element-wise to matrix form:

λ2g(λ)=Adiag(θ)AT(Aθ)(Aθ)T=Covθ(A)=Σ\nabla^2_{\vec{\lambda}} g(\vec{\lambda}) = A \, \text{diag}(\theta^*) A^T - (A \theta^*)(A \theta^*)^T = \text{Cov}_{\theta^*}(A) = \Sigma

This is the exact definition of covariance. The Hessian of the dual objective is the K×KK \times K Covariance Matrix Σ\Sigma of the constraint features under the tilted distribution.

The convergence theorem

Because a covariance matrix is inherently positive semi-definite (Σ0\Sigma \succeq 0), and assuming the constraint features AA are linearly independent, the Hessian is strictly positive definite (Σ0\Sigma \succ 0).

This mathematically guarantees that the dual objective g(λ)g(\vec{\lambda}) is strictly convex globally. Consequently, a multivariate Newton-Raphson optimization step:

λt+1=λt(λ2g)1λg=λtΣ1(Eθ[A]b)\vec{\lambda}_{t+1} = \vec{\lambda}_t - (\nabla^2_{\vec{\lambda}} g)^{-1} \nabla_{\vec{\lambda}} g = \vec{\lambda}_t - \Sigma^{-1} (\mathbb{E}_{\theta^*}[A] - b)

...is mathematically guaranteed to converge monotonically to the unique global optimum from any arbitrary initial vector λ0\vec{\lambda}_0, achieving a quadratic rate of convergence in the neighborhood of the root.

Here is the algorithmic loop in Python:

def esscher_projection(prior_p, A, target_b, max_iters=10, tol=1e-7):
    """
    prior_p:  (N,)  Raw prior probabilities from the policy network
    A:        (K, N) Constraint feature matrix
    target_b: (K,)  Target budgets/expected values
    """
    lam = zeros(K) # Initialize K-dimensional Lagrangian multipliers

    for _ in range(max_iters):
        # 1. Exponential tilting
        tilted_logits = log(prior_p) + (A.T @ lam)
        p_star = softmax(tilted_logits)

        # 2. Compute Gradient (Expected Values & Error)
        ev = A @ p_star
        error = ev - target_b

        if norm(error) < tol:
            break

        # 3. Compute Hessian (Covariance Matrix)
        cov = (A * p_star) @ A.T - outer(ev, ev)

        # 4. Multi-dimensional Newton update
        lam = lam - inv(cov) @ error

    return p_star

The convergence guarantee from the last section already does most of the work. Quadratic convergence means a handful of iterations. The iteration count comes bounded for free. You get to the global minimum starting from any initial condition. You can now put Adam, RMSprop, SGD, and everything for shit objective functions back on the shelf. There is nothing to tune.

5. Making it faster if you really need to

A couple ways to push this further if you are one of those runtime freaks:

  • JIT it. Wrap esscher_projection with @jit — Numba, JAX, or torch.compile, whichever your stack already uses. Tight numerical kernel, no Python overhead. Easiest path.
  • Drop to C++/CUDA. If JIT isn't enough, push the whole projection layer down to a raw extension. Nuclear option, but then you probably have to write the entire thing in cpp too, because a python wrapper can make things slow.
  • FP32, not FP64. 32-bit floats hold 7 decimal digits of precision, more than any physical constraint cares about. Half the memory bandwidth.
  • Single-pass covariance. Keep precomputed second-moment buffers around. Skip the redundant matmuls.

The projection ends up cheap enough to live inside a high-frequency on-policy training loop. Nobody notices.

6. System architecture and agent integration

In production, this is a standard contextual bandit with a strict mathematical projection layer bolted on.

The pipeline begins with the Context. The system observes the environment state. User features, session history, market state, whatever fits.

The decision maker is a standard policy network. It takes the context and outputs unconstrained logits. It dreams of absolute freedom and maximum reward. Out comes the raw prior distribution θ^\hat{\theta}.

Before the environment receives the action, θ^\hat{\theta} gets intercepted by the Esscher transform layer. Constrained-RL algorithms typically assume access to an Oracle — a procedure that returns the constraint-compliant projection of any proposed policy. Here, the Oracle is not assumed. It is computed. The Newton-Raphson solver wakes up. It computes the exact λ\vec{\lambda}, tilts the distribution, and outputs the strictly compliant θ\theta^*.

The system samples the final continuous action from θ\theta^* and executes it.

Then the environment returns a reward. Gradients flow backward through the differentiable Oracle to update the Brain.

7. Practical considerations and architectural adaptations

A few practical approaches/considerations when you actually use this thing.

Dimensionality reduction. If the action space has tens of thousands of items, inverting that big a covariance matrix is infeasible. So we compress the raw actions into a lower-dimensional basis registry (N=30N=30 or so). The Brain outputs weights for these basis functions. The Oracle tilts the basis functions, not the raw items.

Discretization bias. Evaluating continuous functions on discrete lookup tables introduces numerical drift. We fix this by computing the effective empirical means of the basis functions against the actual discrete payload. Those calibrated values feed into the Newton solver. Adherence stays exact.

0 times anything is zero. The Esscher transform is multiplicative scaling (pp^exp()p^* \propto \hat{p} \circ \exp(\dots)). It cannot project UP from zero. 0×anything=00 \times \text{anything} = 0. If the Brain outputs an exact zero prior for an action, it stays zero after projection. Even when resurrecting that probability is necessary to satisfy the constraint. So no ReLU. Nothing that produces hard zeros. Route all raw network outputs through a softmax. Strict positivity before the Oracle sees them.

Gradient alignment. The projection may greatly change the direction of the gradient. We can potentially fix it by adding a KL penalty to the loss (projection_kl) that compares pre and post-projection distributions. The alignment loss forces the Brain to feel the pain of the projection. It learns to output distributions that naturally sit closer to the constraint hyperplane.

8. What can this be used for?

Come on, it’s a mechanism for continuous decision-making under strict linear constraint. It applies broadly.

The below is suggested by AI because I don’t really care, but I think these make sense.

In computational advertising, a policy network can dynamically bid to maximize click-through rate. The Esscher projection guarantees the expected cost per action (CPA) across the campaign matches the advertiser's budget exactly.

In quant finance, an agent can allocate portfolio weights to maximize predicted alpha. The projection layer keeps the expected market beta within risk compliance limits.

In LLM alignment (RLHF), the policy can tune a model to maximize a reward function. Multidimensional constraints keep metrics like toxicity or verbosity below safety thresholds in expectation.

References

  1. Amari, S. I. (2016). Information Geometry and Its Applications (Vol. 194). Springer.
Previous

System Origins

Next Piece

Kinetic Composition →