SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Yang Zhao1,2 Shizhao Sun2 Meisheng Zhang2,3 Yingdong Shi2,4 Xubo Yang1 Jiang Bian2
1Shanghai Jiao Tong University    2Microsoft Research Asia    3Peking University    4ShanghaiTech University

* Work done during an internship at Microsoft Research Asia.   † Corresponding authors.

Teaser figure comparing generation paradigms

Why Self-Reflection Matters

(a) One-Pass Generation lacks intermediate reasoning, leading to severe physical violations. (b) Post-Processing Generation tends to be trapped in local optima. (c) SceneReVis employs a self-reflection paradigm to ensure physical plausibility and aesthetic coherence.

Abstract

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative “diagnose-and-act” loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

Method

SceneReVis training pipeline overview

Training Pipeline Overview

The SceneReVis learning framework consists of: (a) Data Construction via reverse engineering, (b) Cold Start via Supervised Fine-Tuning, (c) Legend, and (d) Agentic RL with GRPO for self-reflective reasoning.

Quantitative Results

Bedroom & Living Room Comparison

Quantitative evaluation on standard room types. SceneReVis significantly reduces collision and out-of-boundary rates while achieving the highest visual quality scores.

Method Bedroom Living Room Avg. (Bed + Living)
Physical Visual-Semantic Physical Visual-Semantic Physical Visual-Semantic
OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑
DiffuScene 46.233.88.26.26.97.1 28.041.57.37.58.27.7 37.137.77.86.97.67.4
Respace 14.741.96.86.17.06.6 11.540.36.96.67.36.9 13.141.16.96.47.26.8
LayoutGPT 34.345.97.85.67.67.0 15.035.77.57.38.77.8 24.740.87.76.58.27.5
I-Design 15.518.28.36.68.17.7 17.814.18.36.58.67.8 16.716.28.36.68.47.8
Holodeck 15.422.27.18.18.67.9 10.03.15.98.28.87.6 12.712.76.58.28.77.8
LayoutVLM 11.544.97.07.68.07.5 14.328.76.87.98.77.8 12.936.86.97.88.47.7
SceneReVis (Ours) 2.84.69.38.08.98.7 1.24.49.58.08.88.8 2.04.59.48.08.98.8

Bold = best  |  Underline = second best  |  Physical: OBR (Out-of-Boundary Rate), CNR (Collision Rate)  |  Visual-Semantic: Ra. (Rationality), Spa. (Spatial), Ac. (Accuracy), Avg.

Long-tail Generalization: Dining Room & Study Room

Evaluation on long-tail room types unseen during training, demonstrating SceneReVis’s strong generalization capability.

Method Dining Room Study Room Avg. (Dining + Study)
Physical Visual-Semantic Physical Visual-Semantic Physical Visual-Semantic
OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑
LayoutGPT 6.226.26.97.78.87.8 16.652.18.57.28.38.0 11.439.27.77.58.67.9
I-Design 17.60.77.85.78.87.4 16.34.58.26.98.98.0 17.02.68.06.38.97.7
Holodeck 11.93.36.48.08.57.6 13.75.16.78.48.98.0 12.84.26.68.28.77.8
LayoutVLM 19.227.06.57.88.87.7 12.336.66.97.68.67.7 15.831.86.77.78.77.7
SceneReVis (Ours) 0.12.88.18.38.68.3 0.52.18.28.38.98.5 0.32.58.28.38.88.4

Bold = best  |  Underline = second best  |  Physical  |  Visual-Semantic

Goal-Oriented Scene Optimization

SceneReVis excels at fixing chaotic layouts and completing missing furniture, achieving the best overall quality across all three conditions.

Method Cond 1: Chaotic & Missing Cond 2: Chaotic Only Cond 3: Missing Only Overall Average
Physical Visual-Semantic Physical Visual-Semantic Physical Visual-Semantic Physical Visual-Semantic
OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑ OBR↓CNR↓Ra.↑Spa.↑Ac.↑Avg.↑
DiffuScene 50.830.37.57.38.57.8 62.824.17.47.08.77.7 27.830.27.37.07.77.3 47.128.27.47.18.37.6
LayoutVLM 10.039.76.76.38.77.2 8.163.17.26.68.37.4 6.531.06.56.48.77.2 8.244.66.86.48.67.3
SceneReVis (Ours) 1.11.88.88.18.98.6 2.71.37.77.98.58.0 1.11.07.97.88.78.1 1.61.48.17.98.78.2

Bold = best  |  Underline = second best  |  Physical  |  Visual-Semantic

Qualitative Results

Citation

@article{zhao2026scenerevis,
  title={SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL},
  author={Yang Zhao and Shizhao Sun and Meisheng Zhang and Yingdong Shi and Xubo Yang and Jiang Bian},
  journal={arXiv preprint arXiv:2602.09432},
  year={2026}
}