Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

Zhan, Yifan; Chen, Zhengqing; Wang, Qingjie; He, Zhuo; Niu, Muyao; Guo, Xiaoyang; Yin, Wei; Ren, Weiqiang; Zhang, Qian; Zheng, Yinqiang

CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

ECCV 2026

Yifan Zhan^1* Zhengqing Chen^2*‡ Qingjie Wang^2* Zhuo He³ Muyao Niu¹ Xiaoyang Guo² Wei Yin² Weiqiang Ren² Qian Zhang² Yinqiang Zheng^1†

¹The University of Tokyo ²Horizon Robotics ³University of Glasgow

^*Equal contribution ^‡Project lead ^†Corresponding author

Paper Code arXiv

Abstract

A major challenge in autonomous driving is the "long tail" of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

+17%

FVD Improvement
Identity Editing

−30%

Rotation Error
Action Control

−47%

Translation Error
Action Control

+173%

Collision Rate
Stress Testing

Method Overview

Overview of the CompoSIA framework. During training (top), structure, identity, and ego-action signals are explicitly decomposed and injected into a Flow Matching–based DiT backbone in a disentangled yet compositional manner. During sampling (bottom), different combinations of conditions enable controllable driving video generation, supporting multiple editing applications.

Color Legend

Structure Spatial layout & motion

Identity Appearance & type

Action Ego vehicle behavior

Compositional Control

Left car skids Ego sudden stop

Front car starts Object inserted Ego starts

Left car cuts in Ego stop

Front car starts ID changed Ego starts

Left car cuts in Ego sudden stop

Front car starts ID changed

Right car cuts in Ego sudden stop

Front truck skids toward ego ID changed

Front truck cuts in ID changed

Structure Control

Right car cuts in

Front vehicle removed

Identity Control

Right bus to this ID

Left car to this ID

Action Control

Change to the left line

Go straight

Change to the left line

Ablations

(Left) Without action conditioning, the ego motion becomes unstable. (Right) Without structure conditioning, surrounding vehicles fail to maintain consistent motion and spatial alignment.

Ablations on identity conditioning. Earlier stopping (T_id = 0.6) increases generation freedom but reduces similarity to the reference image, whereas later stopping (T_id = 0.2) enforces stronger identity preservation but increasingly anchors the generation to the reference image.

Planning Stress Testing

Video

Conditioning frames (past 6 frames)

BEV

Future prediction (next 6 frames)

Scenario 1

Before Editing (GT)

After Editing

Scenario 2

Before Editing (GT)

After Editing

BibTeX

@article{zhan2026composing,
  title={Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation},
  author={Zhan, Yifan and Chen, Zhengqing and Wang, Qingjie and He, Zhuo and Niu, Muyao and Guo, Xiaoyang and Yin, Wei and Ren, Weiqiang and Zhang, Qian and Zheng, Yinqiang},
  journal={arXiv preprint arXiv:2603.12864},
  year={2026}}