SceneReVis – A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis

Teaser figure comparing generation paradigms

Why Self-Reflection Matters

(a) One-Pass Generation lacks intermediate reasoning, leading to severe physical violations. (b) Post-Processing Generation tends to be trapped in local optima. (c) SceneReVis employs a self-reflection paradigm to ensure physical plausibility and aesthetic coherence.

Abstract

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative “diagnose-and-act” loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

Method

Training Pipeline Overview

The SceneReVis learning framework consists of: (a) Data Construction via reverse engineering, (b) Cold Start via Supervised Fine-Tuning, (c) Legend, and (d) Agentic RL with GRPO for self-reflective reasoning.

Quantitative Results

Bedroom & Living Room Comparison

Quantitative evaluation on standard room types. SceneReVis significantly reduces collision and out-of-boundary rates while achieving the highest visual quality scores.

Method	Bedroom						Living Room						Avg. (Bed + Living)
	Physical		Visual-Semantic				Physical		Visual-Semantic				Physical		Visual-Semantic
	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑
DiffuScene	46.2	33.8	8.2	6.2	6.9	7.1	28.0	41.5	7.3	7.5	8.2	7.7	37.1	37.7	7.8	6.9	7.6	7.4
Respace	14.7	41.9	6.8	6.1	7.0	6.6	11.5	40.3	6.9	6.6	7.3	6.9	13.1	41.1	6.9	6.4	7.2	6.8
LayoutGPT	34.3	45.9	7.8	5.6	7.6	7.0	15.0	35.7	7.5	7.3	8.7	7.8	24.7	40.8	7.7	6.5	8.2	7.5
I-Design	15.5	18.2	8.3	6.6	8.1	7.7	17.8	14.1	8.3	6.5	8.6	7.8	16.7	16.2	8.3	6.6	8.4	7.8
Holodeck	15.4	22.2	7.1	8.1	8.6	7.9	10.0	3.1	5.9	8.2	8.8	7.6	12.7	12.7	6.5	8.2	8.7	7.8
LayoutVLM	11.5	44.9	7.0	7.6	8.0	7.5	14.3	28.7	6.8	7.9	8.7	7.8	12.9	36.8	6.9	7.8	8.4	7.7
SceneReVis (Ours)	2.8	4.6	9.3	8.0	8.9	8.7	1.2	4.4	9.5	8.0	8.8	8.8	2.0	4.5	9.4	8.0	8.9	8.8

Bold = best | Underline = second best | ■ Physical: OBR (Out-of-Boundary Rate), CNR (Collision Rate) | ■ Visual-Semantic: Ra. (Rationality), Spa. (Spatial), Ac. (Accuracy), Avg.

Long-tail Generalization: Dining Room & Study Room

Evaluation on long-tail room types unseen during training, demonstrating SceneReVis’s strong generalization capability.

Method	Dining Room						Study Room						Avg. (Dining + Study)
	Physical		Visual-Semantic				Physical		Visual-Semantic				Physical		Visual-Semantic
	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑
LayoutGPT	6.2	26.2	6.9	7.7	8.8	7.8	16.6	52.1	8.5	7.2	8.3	8.0	11.4	39.2	7.7	7.5	8.6	7.9
I-Design	17.6	0.7	7.8	5.7	8.8	7.4	16.3	4.5	8.2	6.9	8.9	8.0	17.0	2.6	8.0	6.3	8.9	7.7
Holodeck	11.9	3.3	6.4	8.0	8.5	7.6	13.7	5.1	6.7	8.4	8.9	8.0	12.8	4.2	6.6	8.2	8.7	7.8
LayoutVLM	19.2	27.0	6.5	7.8	8.8	7.7	12.3	36.6	6.9	7.6	8.6	7.7	15.8	31.8	6.7	7.7	8.7	7.7
SceneReVis (Ours)	0.1	2.8	8.1	8.3	8.6	8.3	0.5	2.1	8.2	8.3	8.9	8.5	0.3	2.5	8.2	8.3	8.8	8.4

Bold = best | Underline = second best | ■ Physical | ■ Visual-Semantic

Goal-Oriented Scene Optimization

SceneReVis excels at fixing chaotic layouts and completing missing furniture, achieving the best overall quality across all three conditions.

Method	Cond 1: Chaotic & Missing						Cond 2: Chaotic Only						Cond 3: Missing Only						Overall Average
	Physical		Visual-Semantic				Physical		Visual-Semantic				Physical		Visual-Semantic				Physical		Visual-Semantic
	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑	OBR↓	CNR↓	Ra.↑	Spa.↑	Ac.↑	Avg.↑
DiffuScene	50.8	30.3	7.5	7.3	8.5	7.8	62.8	24.1	7.4	7.0	8.7	7.7	27.8	30.2	7.3	7.0	7.7	7.3	47.1	28.2	7.4	7.1	8.3	7.6
LayoutVLM	10.0	39.7	6.7	6.3	8.7	7.2	8.1	63.1	7.2	6.6	8.3	7.4	6.5	31.0	6.5	6.4	8.7	7.2	8.2	44.6	6.8	6.4	8.6	7.3
SceneReVis (Ours)	1.1	1.8	8.8	8.1	8.9	8.6	2.7	1.3	7.7	7.9	8.5	8.0	1.1	1.0	7.9	7.8	8.7	8.1	1.6	1.4	8.1	7.9	8.7	8.2

Bold = best | Underline = second best | ■ Physical | ■ Visual-Semantic

Qualitative Results

Standard Scenes: Bedroom & Living Room

SceneReVis produces physically plausible layouts with no collisions, while baseline methods exhibit various spatial violations.

Qualitative comparison for Bedroom and Living Room

Long-tail Generalization: Dining Room & Study Room

Even on unseen room types, SceneReVis maintains high spatial accuracy and semantic coherence.

Qualitative comparison for long-tail scenes

Additional Comparisons with Baselines

Side-by-side comparisons further highlight SceneReVis’s superior generation quality over existing methods.

Goal-Oriented Scene Optimization

SceneReVis successfully repairs chaotic layouts and fills missing furniture across all three conditions.

Diverse Room Generation

Additional results across diverse room types — gym, entertainment room, office, and more — showcasing the versatility of SceneReVis.

Citation

@article{zhao2026scenerevis,
  title={SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL},
  author={Yang Zhao and Shizhao Sun and Meisheng Zhang and Yingdong Shi and Xubo Yang and Jiang Bian},
  journal={arXiv preprint arXiv:2602.09432},
  year={2026}
}