Mean Collapse is a Serious Bottleneck of VLA

Ant Maze rollout — 3rd-person view (left) and top-view (right)

TL;DR

On OGBench Ant-Maze we train a VLM-based VLA (Qwen3-VL 2B + LoRA, OpenVLA-OFT style) under two cameras and two heads — top-view (MDP) coordinate-VQA and 3rd-person (POMDP) direct action. Both mean-collapse, but the bare top-view collapses to 0%: the ant tips over, twitches, freezes.

The cause specific to VLM-based VLAs is ViT-level feature aliasing — the ant covers a few pixels in top-view, ViT patches are nearly constant across the episode, the LLM sees the same input every step. Higher-resolution top-view fixes it. The coordinate-VQA reformulation also works: top-view recovers to 0.53.

A small memory-based steering module (GLA Linear-SSM + Perceiver) on the frozen 3rd-person VLA lifts success from 0.16 → 0.84 — close to the 0.91 we get with both cameras + VQA, even though steering uses only the 3rd-person input that would be available in the real world. The same Markovian-on-$o_t$ failure shows up on a GUI agent (GELab-Zero) looping C → D → E → C → D → E → … on a Duolingo quiz.


1. MDP vs. POMDP — and Why Mean Collapse Happens

An MDP gives the agent the state directly: $\pi^{\star}_{\mathrm{MDP}}(a_t \mid s_t)$. A POMDP only gives observations $o_t \sim \Omega(\cdot \mid s_t)$, so the optimal policy must condition on history $h_t = (o_0, a_0, \ldots, o_t)$:

$$\pi^{\star}_{\mathrm{POMDP}}(a_t \mid h_t).$$

A Markovian policy on $o_t$ is suboptimal whenever $o_t \not\Rightarrow s_t$.

1.1 Why Mean Collapse Happens

A VLM-based VLA trained to act on the latest observation only learns $\pi(a_t \mid o_t)$. When several states alias to similar observations ($s^{(1)} \neq s^{(2)}$ but $o^{(1)} \approx o^{(2)} \approx o$), the loss is minimized by the posterior-weighted average:

$$ \pi^{\star}(a \mid o) \;=\; \sum_{i} p\!\left(s^{(i)} \mid o\right)\, \pi^{\star}\!\left(a \mid s^{(i)}\right). $$

When per-state optima point in opposite directions, the average is a degenerate output: a near-zero motion, a sideways drift, an oscillation. That is mean collapse — and on Ant-Maze it appears even on the privileged top-view input (Section 6).


2. Most VLA Benchmarks Hide the POMDP

LIBERO, SIMPLER, UI-TARS all assume $\pi(a_t \mid o_t)$ suffices — the current scene already encodes the task-relevant state. Real long-horizon tasks (Minecraft, GUI agents, occluded manipulation) need history $h_t$. Crafting a pickaxe needs “I already chopped a tree”, which is not in the current screen.


3. A Controllable POMDP in OGBench Ant-Maze

We re-render OGBench Ant-Maze in MuJoCo with two cameras — the only thing that differs across our settings is observability:

Same world, two views Same world, two views. Left: top-view (MDP). Right: 3rd-person (POMDP).

4. Setup

4.1 Data

4.2 Two heads, one backbone

Both settings share Qwen3-VL 2B with LoRA on ViT and the LLM head, trained OpenVLA-OFT-style. They differ only in observability and the output head:

Ant VLA training pipeline One backbone, two output heads (top-view waypoints vs. 3rd-person action chunk).
Qwen3-VL 2B backbone architecture Qwen3-VL 2B backbone — vision encoder (ViT) feeding into the LLM decoder. LoRA adapters sit on the ViT and the LLM head; everything else stays frozen at the OpenVLA-OFT level.

5. Both Views Mean-Collapse — Bare Top-View Collapses to 0%

The bare top-view + direct-action setting collapses to 0%: the ant tips, twitches, freezes. Only the coordinate-VQA reformulation rescues top-view (see Section 7.3 for the numbers). The 3rd-person rollouts also collapse, but the agent moves more — sometimes winning by exploration when the goal comes into view.

5.1 Top-view (MDP) — different loop-case examples

Loop-case example 1 — the ant tips over almost immediately at the start. The simulation marks the episode as failed and the rollout halts in place. (Full video.)
Loop-case example 2 — the agent twitches in place and cycles between the same two actions in a corridor.

5.2 Top-view — extreme stuck case

Loop-case example 3 — after ~step 40 the predicted waypoints collapse to a near-zero mean and the ant is frozen for the rest of the rollout.

5.3 3rd-person (POMDP) — opportunistic successes

3rd-person, no memory — opportunistic success.
3rd-person, no memory — opportunistic success on a different seed.

6. Why the Top-View Collapse is Worse — ViT-Level Feature Aliasing

Classical mean collapse (Section 1.1) is about state aliasing in the pixel space. On the top-view that should not apply — the state is fully visible. Yet the bare top-view collapses to 0%. The aliasing happens one level up, in the ViT feature space:

Top-view has the state, but loses it inside the ViT. 3rd-person is a POMDP, yet its features are sharper.

6.1 Sanity check — higher-resolution top-view fixes it

If feature aliasing is really the cause, more pixels per scene should sharpen the features and reduce collapse. Confirmed: feeding higher-resolution top-view images into the ViT (same backbone, same head, same prompt) recovers performance. When a VLM-based VLA mean-collapses on a supposedly-MDP input, suspect resolution / feature aliasing first.


7. The Fix — A Memory-Based Steering Module on the Frozen VLA

Steering module added to the frozen VLA Learnable steering (peach) builds a fixed-size memory $Z$ from past ViT features via a GLA temporal mixer + Perceiver compression, then injects it into the frozen ViT at L8 / L16 and the action head through gated residuals. The frozen base VLA (green) is never re-trained.
The steering module is what solves mean collapse on the 3rd-person POMDP — past the top-view MDP baseline, without re-training the base VLA.

7.1 Formulation

Past ViT features $h^{(\ell)}_{t-k}$ at depths $\ell \in \{8, 16\}$ for $k = 1, \ldots, K$ are mixed by a Gated Linear Attention (Linear SSM) with content-gated decay:

$$ \begin{aligned} S_k &= \sigma(g_k) \odot S_{k-1} + K_k^{\top} V_k, \\ O_k &= Q_k\, S_k. \end{aligned} $$

The mixed sequence is compressed to $m = 32$ memory tokens by a Perceiver cross-attention with learnable queries $Q_{\text{learn}} \in \mathbb{R}^{m \times d}$:

$$ Z^{(\ell)} \;=\; \mathrm{CrossAttn}\!\left(Q_{\text{learn}},\; O^{(\ell)}_{1:K}\right). $$

Memory is injected back into the frozen ViT through gated residuals, and a small residual is added to the frozen action head:

$$ \tilde{h}^{(\ell)}_t = h^{(\ell)}_t + \sigma(\alpha_\ell)\, \mathrm{CrossAttn}\!\left(h^{(\ell)}_t, Z^{(\ell)}\right), \qquad a_t = \underbrace{\pi_{\text{base}}(o_t)}_{\text{frozen}} + \sigma(\alpha_{\text{res}})\, \Delta\!\left(Z^{(16)}\right). $$

Identity-init gates ($\sigma(\alpha) = 0$) make the module a no-op at initialization — the base VLA is recovered exactly, only corrections are learned.

7.2 Why a State-Space Model?

7.3 Results — full picture in one table

Setting Camera Output head Success rate
Top-view, no VQA Top-view (MDP) Direct action chunk 0.00
Top-view, with VQA Top-view (MDP) Coordinate-VQA waypoints 0.53
3rd-person, baseline 3rd-person (POMDP) Direct action chunk 0.16
3rd-person, with steering 3rd-person (POMDP) Direct action + steering module 0.84
Top-view + 3rd-person, with VQA Both cameras Coordinate-VQA waypoints 0.91
Bare top-view direct-action collapses to 0%; the coordinate-VQA reformulation rescues it. The steering module on the frozen 3rd-person VLA pushes success past the top-view MDP baseline. Feeding both cameras together with VQA gives the best number (0.91).

7.4 Reading the table

The top number — 0.91 with both cameras + VQA — is what we got when the top-view (MDP layout) is paired with the 3rd-person view (scene-rich, high-res ViT-friendly content). Our reading: top-view supplies the MDP state, and 3rd-person supplies the kind of high-resolution patch features the ViT needs to avoid feature aliasing — the two views compensate for each other's failure modes.

The surprising part is how close the 3rd-person + steering module number (0.84) gets to the privileged-top-view 0.91. This matters because in the real world a top-view is almost never available; steering on the frozen 3rd-person VLA is the practically meaningful intervention, and the fact that the gap to a setup with the bird's-eye state is so small is the encouraging signal.

On where the remaining failures come from: most of them are not mean-collapse anymore. They are locomotion failures — the ant moves quickly mid-trajectory and tips over, after which the simulation marks the episode as failed. The mean-collapse pathology has largely been removed; what is left is a controller/dynamics issue rather than the VLA itself.


8. Bonus — Same Pathology on a GUI Agent

The same Markovian-on-$o_t$ failure shows up on a real GUI agent. GELab-Zero-4B-preview on a Duolingo “tap the matching pairs” quiz: solved tiles look almost identical to fresh ones, the agent has no memory of what it has tapped, so it re-taps a solved pair and the screen returns to the same state — a three-step loop:

$$C \rightarrow D \rightarrow E \rightarrow C \rightarrow D \rightarrow E \rightarrow \ldots$$

Testing GELab-Zero on Duolingo matching quiz GELab-Zero-4B-preview on the Duolingo quiz — each step it sees only the current screenshot.
GUI loop case definition — C → D → E → C → D → E … Loop-case definition: $o_t$ (screenshot) $\not\Rightarrow s_t$ (solved-set).
A separate loop-case example, distinct from the Ant videos — a live GELab-Zero rollout (sped up) entering the C → D → E → C → D → E → … cycle.

9. Takeaway


References

Acknowledgement

This research was made possible by the TPU Research Cloud (TRC) program. We are grateful to the TRC Team and Google for providing the TPU compute that made these experiments feasible.

Sub-project of Project Weasel, maintained by Kang Minkyu.