Architecting High-Efficiency Embodied Intelligence via OneVL Latent Representations

The landscape of embodied intelligence is currently undergoing a seismic shift as researchers attempt to bridge the gap between massive Vision-Language Models (VLMs) and the real-time latency requirements of autonomous driving and robotic control. Traditional approaches have long relied on explicit Autoregressive (AR) Chain-of-Thought (CoT) reasoning, where a model verbalizes its internal reasoning steps before arriving at a decision. While this provides human-interpretable transparency, it introduces a prohibitive computational tax, forcing high-parameter models into slow, sequential inference cycles that are often incompatible with high-speed physical environments. The emergence of OneVL, a breakthrough framework developed by the Xiaomi Embodied Intelligence Team, represents a fundamental departure from these sequential paradigms. By leveraging latent CoT mechanisms, OneVL compresses the reasoning process into a highly efficient bottleneck of visual and linguistic tokens, effectively bypassing the latency penalties of explicit verbalization while simultaneously outperforming much larger models in predictive accuracy across multiple critical benchmarks.

The Structural Paradigm of OneVL and Latent Bottlenecking

At the core of OneVL's architectural innovation is the deliberate implementation of a tight information bottleneck. Rather than allowing the model to drift into verbose, redundant verbalizations, the architecture constrains the reasoning capacity through a specific allocation of latent tokens.

The model utilizes a precise configuration of 35 visual latent tokens and 20 language latent tokens. This specific numerical constraint serves a dual purpose: it forces the model to distill the complex, high-dimensional causal structure of a driving scene into a compact representation, and it actively discourages the phenomenon of rote memorization. By limiting the capacity of the latent space, the model is compelled to develop generalizable representations that capture the underlying physical and semantic logic of the environment, rather than simply mapping input pixels to output trajectories via memorized patterns.

To maintain the utility of this compressed information, OneVL employs two distinct auxiliary decoders that ground the latent space in meaningful modalities:

  • The language auxiliary decoder recovers human-readable Chain-of-Thought (CoT) text from the language latent states. This provides the semantic intent necessary for human oversight, covering scene interpretation, object analysis, and specific driving decisions.
  • The visual auxiliary decoder acts as a world model. It predicts future-frame visual tokens at increments of +0.5s and +1.0s. This component is critical because it forces the latent space to capture physical scene dynamics, providing a causal compression target that linguistic descriptions alone cannot facilitate.

During the inference phase, the operational efficiency is realized through the discarding of these auxiliary decoders. All latent tokens are prefilled in a single parallel pass, which allows the model to match the latency of simple answer-only AR predictions, effectively eliminating the "reasoning tax" traditionally associated with CoT.

Comparative Performance Analysis across Autonomous Driving Benchmarks

The efficacy of OneVL is not merely theoretical; it has been validated across four primary evaluation suites: NAVSIM, ROADWork, Impromptu, and Alpamayo-R1. In every instance, the 4B parameter OneVL model demonstrates a capacity to outperform models with significantly higher parameter counts, such as 8B parameter baselines.

NAVSIM Benchmark Evaluation

On the NAVSIM benchmark, which tests the model's ability to handle complex navigation tasks, OneVL achieves a PDM-score of 88.84. This performance is notable when compared to the existing state-of-the-art (SOTA) models.

Method Model Size PDM-score ↑ Latency (s) ↓ Interpretability
AdaThinkDrive 8B 86.20 Language
LaST-VLA 8B 87.30
AR Answer 4B 87.47 4.49
AR CoT+Answer 4B 88.29 6.58 Language
COCONUT 4B 84.84 5.93
CODI 4B 83.92 8.62
SIM-CoT 4B 84.21 10.86 Language
OneVL 4B 88.84 4.46 Vision + Language

The latency metrics reveal the efficiency of the latent approach. OneVL's prefill inference reaches 4.46s, which is essentially identical to the 4.49s latency of the AR Answer baseline, yet it provides superior predictive capability. Most importantly, it is 32% faster than the explicit AR CoT+Answer method, which requires 6.58s to complete its reasoning cycle.

ROADWork Benchmark Evaluation

The ROADWork benchmark evaluates the precision of trajectory prediction in terms of spatial error. OneVL's performance here significantly distances itself from previous iterations of driving models.

Method ADE (pixel) ↓ FDE (pixel) ↓ Latency (s) ↓ Interpretability
YNet 22.68 80.78
AR Answer 15.98 40.29 4.74
AR CoT+Answer 13.18 29.98 10.74 Language
COCONUT 15.44 38.60 6.06
CODI 16.45 44.28 6.73
SIM-CoT 16.49 44.32 6.19 Language
OneVL 12.49 28.80 4.71 Vision + Language

OneVL achieves an Average Displacement Error (ADE) of 12.49 pixels and a Final Displacement Error (FDE) of 28.80 pixels. This is a massive improvement over the previous SOTA, YNet, which recorded an ADE of 22.68 and an FDE of 80.78. Furthermore, the model is more than twice as fast as the explicit AR CoT+Answer baseline, which languishes at 10.74s latency.

Impromptu Benchmark Evaluation

The Impromptu benchmark utilizes metric-based spatial error (meters) to evaluate trajectory accuracy.

Method ADE (m) ↓ FDE (m) ↓ Latency (s) ↓ Interpretability
Cosmos-Reason 2.86 7.42 Language
AR Answer 3.27 3.06 9.59
AR CoT+Answer 2.99 3.51 8.54 Language
COCONUT 3.29 3.76 9.48
CODI 3.22 3.85 9.25
SIM-CoT 3.40 3.78 9.85 Language
OneVL 2.62 3.23 7.53 Vision + Language

In this environment, OneVL achieves an ADE of 1.34 and an FDE of 3.70. This performance surpasses both the dedicated Impromptu VLA (1.60 / 4.28) and the explicit AR CoT (1.42 / 3.96), demonstrating the model's robustness in unpredictable scenarios.

The Three-Stage Optimization Pipeline

Training a model like OneVL is a complex orchestration problem. Because the main VLM, the language auxiliary decoder, and the visual auxiliary decoder all operate with fundamentally different learning objectives, a simple end-to-end training pass is insufficient. The Xiaomi team developed a principled three-stage pipeline to ensure these components align without losing the efficiency of the latent bottleneck.

  1. Stage One: End-to-End Trajectory Training
    The main VLM is trained end-to-end on trajectory prediction tasks. During this phase, latent tokens are embedded within each training sample. This stage is vital for allowing the model to develop the initial latent representations and establish the necessary information routing pathways that will later be utilized by the decoders.

  2. Stage Two: Auxiliary Decoder Alignment
    Once the main model has established a stable latent space, the main model is frozen. The training focus shifts to the auxiliary decoders. The language decoder is trained to accurately decode CoT text from the latent states, while the visual decoder is trained to predict future frames. This ensures that the latent tokens actually contain the semantic and physical information required for the decoders to function.

  3. Stage Three: Joint Fine-Tuning
    In the final stage, all three components—the main VLM, the language decoder, and the visual decoder—are jointly fine-tuned. This is the most critical phase. Gradients from both the language and visual decoders flow back into the main VLM. This creates a "virtuous cycle" where the requirement to satisfy both linguistic and visual predictive objectives further tightens and refines the information bottleneck.

The necessity of this pipeline is underscored by empirical evidence: skipping this multi-stage alignment causes a catastrophic performance drop of 21.71 points. Additionally, the visual decoder is identified as the component providing the single largest performance gain (+0.87), highlighting the importance of the "world model" aspect of the architecture.

Real-World Deployment and the MLP Variant

While the standard OneVL architecture is highly efficient for research and high-level simulation, real-world deployment in edge computing devices (such as those found in autonomous vehicles) requires even lower latency.

To address this, a specialized Multi-Layer Perceptron (MLP) variant has been developed. By appending a compact MLP head directly onto the Qwen3-VL backbone, the model can bypass the complex decoding processes entirely during real-time operation. This allows OneVL to predict trajectories in a single feed-forward pass.

The performance leap for this variant is substantial:
- The MLP variant achieves a latency of 0.24s.
- This translates to an operational frequency of 4.16 Hz.

This capability allows OneVL to leverage the high-quality, multimodal latent supervision learned during the three-stage training process while delivering the ultra-low latency required for real-time physical control.

Analytical Conclusion

The development of OneVL represents a significant milestone in the pursuit of efficient embodied intelligence. By successfully implementing a latent Chain-of-Thought mechanism, the Xiaomi Embodied Intelligence Team has solved one of the most pressing trade-offs in AI: the tension between reasoning depth and inference speed.

The architectural brilliance of OneVL lies not just in its ability to compress information into 55 specific latent tokens, but in how it uses auxiliary decoders to ensure that this compression does not result in a loss of critical causal information. The ability of a 4B parameter model to consistently outperform 8B parameter models across diverse benchmarks like NAVSIM and ROADWork proves that parameter count is not a substitute for structural intelligence. Furthermore, the three-stage training pipeline provides a repeatable framework for aligning disparate learning objectives—semantic, physical, and predictive—into a single, unified latent representation. As autonomous systems move from controlled environments to the unpredictable reality of the physical world, the ability to perform high-fidelity, multi-modal reasoning at millisecond scales will be the defining characteristic of successful embodied agents. OneVL provides a blueprint for this future, proving that intelligence can be both deep and incredibly fast.

Sources

  1. Arza300/v Releases
  2. OneVL: Xiaomi Embodied Intelligence
  3. Arza300 GitHub Profile

Related Posts