* Work done during an internship at Microsoft Research Asia. † Corresponding authors.
Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative “diagnose-and-act” loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
The SceneReVis learning framework consists of: (a) Data Construction via reverse engineering, (b) Cold Start via Supervised Fine-Tuning, (c) Legend, and (d) Agentic RL with GRPO for self-reflective reasoning.
Quantitative evaluation on standard room types. SceneReVis significantly reduces collision and out-of-boundary rates while achieving the highest visual quality scores.
| Method | Bedroom | Living Room | Avg. (Bed + Living) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Physical | Visual-Semantic | Physical | Visual-Semantic | Physical | Visual-Semantic | |||||||||||||
| OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | |
| DiffuScene | 46.2 | 33.8 | 8.2 | 6.2 | 6.9 | 7.1 | 28.0 | 41.5 | 7.3 | 7.5 | 8.2 | 7.7 | 37.1 | 37.7 | 7.8 | 6.9 | 7.6 | 7.4 |
| Respace | 14.7 | 41.9 | 6.8 | 6.1 | 7.0 | 6.6 | 11.5 | 40.3 | 6.9 | 6.6 | 7.3 | 6.9 | 13.1 | 41.1 | 6.9 | 6.4 | 7.2 | 6.8 |
| LayoutGPT | 34.3 | 45.9 | 7.8 | 5.6 | 7.6 | 7.0 | 15.0 | 35.7 | 7.5 | 7.3 | 8.7 | 7.8 | 24.7 | 40.8 | 7.7 | 6.5 | 8.2 | 7.5 |
| I-Design | 15.5 | 18.2 | 8.3 | 6.6 | 8.1 | 7.7 | 17.8 | 14.1 | 8.3 | 6.5 | 8.6 | 7.8 | 16.7 | 16.2 | 8.3 | 6.6 | 8.4 | 7.8 |
| Holodeck | 15.4 | 22.2 | 7.1 | 8.1 | 8.6 | 7.9 | 10.0 | 3.1 | 5.9 | 8.2 | 8.8 | 7.6 | 12.7 | 12.7 | 6.5 | 8.2 | 8.7 | 7.8 |
| LayoutVLM | 11.5 | 44.9 | 7.0 | 7.6 | 8.0 | 7.5 | 14.3 | 28.7 | 6.8 | 7.9 | 8.7 | 7.8 | 12.9 | 36.8 | 6.9 | 7.8 | 8.4 | 7.7 |
| SceneReVis (Ours) | 2.8 | 4.6 | 9.3 | 8.0 | 8.9 | 8.7 | 1.2 | 4.4 | 9.5 | 8.0 | 8.8 | 8.8 | 2.0 | 4.5 | 9.4 | 8.0 | 8.9 | 8.8 |
Bold = best | Underline = second best | ■ Physical: OBR (Out-of-Boundary Rate), CNR (Collision Rate) | ■ Visual-Semantic: Ra. (Rationality), Spa. (Spatial), Ac. (Accuracy), Avg.
Evaluation on long-tail room types unseen during training, demonstrating SceneReVis’s strong generalization capability.
| Method | Dining Room | Study Room | Avg. (Dining + Study) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Physical | Visual-Semantic | Physical | Visual-Semantic | Physical | Visual-Semantic | |||||||||||||
| OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | |
| LayoutGPT | 6.2 | 26.2 | 6.9 | 7.7 | 8.8 | 7.8 | 16.6 | 52.1 | 8.5 | 7.2 | 8.3 | 8.0 | 11.4 | 39.2 | 7.7 | 7.5 | 8.6 | 7.9 |
| I-Design | 17.6 | 0.7 | 7.8 | 5.7 | 8.8 | 7.4 | 16.3 | 4.5 | 8.2 | 6.9 | 8.9 | 8.0 | 17.0 | 2.6 | 8.0 | 6.3 | 8.9 | 7.7 |
| Holodeck | 11.9 | 3.3 | 6.4 | 8.0 | 8.5 | 7.6 | 13.7 | 5.1 | 6.7 | 8.4 | 8.9 | 8.0 | 12.8 | 4.2 | 6.6 | 8.2 | 8.7 | 7.8 |
| LayoutVLM | 19.2 | 27.0 | 6.5 | 7.8 | 8.8 | 7.7 | 12.3 | 36.6 | 6.9 | 7.6 | 8.6 | 7.7 | 15.8 | 31.8 | 6.7 | 7.7 | 8.7 | 7.7 |
| SceneReVis (Ours) | 0.1 | 2.8 | 8.1 | 8.3 | 8.6 | 8.3 | 0.5 | 2.1 | 8.2 | 8.3 | 8.9 | 8.5 | 0.3 | 2.5 | 8.2 | 8.3 | 8.8 | 8.4 |
Bold = best | Underline = second best | ■ Physical | ■ Visual-Semantic
SceneReVis excels at fixing chaotic layouts and completing missing furniture, achieving the best overall quality across all three conditions.
| Method | Cond 1: Chaotic & Missing | Cond 2: Chaotic Only | Cond 3: Missing Only | Overall Average | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Physical | Visual-Semantic | Physical | Visual-Semantic | Physical | Visual-Semantic | Physical | Visual-Semantic | |||||||||||||||||
| OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | OBR↓ | CNR↓ | Ra.↑ | Spa.↑ | Ac.↑ | Avg.↑ | |
| DiffuScene | 50.8 | 30.3 | 7.5 | 7.3 | 8.5 | 7.8 | 62.8 | 24.1 | 7.4 | 7.0 | 8.7 | 7.7 | 27.8 | 30.2 | 7.3 | 7.0 | 7.7 | 7.3 | 47.1 | 28.2 | 7.4 | 7.1 | 8.3 | 7.6 |
| LayoutVLM | 10.0 | 39.7 | 6.7 | 6.3 | 8.7 | 7.2 | 8.1 | 63.1 | 7.2 | 6.6 | 8.3 | 7.4 | 6.5 | 31.0 | 6.5 | 6.4 | 8.7 | 7.2 | 8.2 | 44.6 | 6.8 | 6.4 | 8.6 | 7.3 |
| SceneReVis (Ours) | 1.1 | 1.8 | 8.8 | 8.1 | 8.9 | 8.6 | 2.7 | 1.3 | 7.7 | 7.9 | 8.5 | 8.0 | 1.1 | 1.0 | 7.9 | 7.8 | 8.7 | 8.1 | 1.6 | 1.4 | 8.1 | 7.9 | 8.7 | 8.2 |
Bold = best | Underline = second best | ■ Physical | ■ Visual-Semantic
SceneReVis produces physically plausible layouts with no collisions, while baseline methods exhibit various spatial violations.
Even on unseen room types, SceneReVis maintains high spatial accuracy and semantic coherence.
Side-by-side comparisons further highlight SceneReVis’s superior generation quality over existing methods.
SceneReVis successfully repairs chaotic layouts and fills missing furniture across all three conditions.
Additional results across diverse room types — gym, entertainment room, office, and more — showcasing the versatility of SceneReVis.
@article{zhao2026scenerevis,
title={SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL},
author={Yang Zhao and Shizhao Sun and Meisheng Zhang and Yingdong Shi and Xubo Yang and Jiang Bian},
journal={arXiv preprint arXiv:2602.09432},
year={2026}
}