AnimateScene: Camera-controllable Animation in Any Scene

Abstract

3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years. However, seamlessly integrating reconstructed scenes with 4D human animation and produce visually engaging results remains challenging. One key difficulty lies in placing the human at the correct location and scale within the scene while avoiding unrealistic interpenetration. Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites. In addition, appealing character motion videos are often accompanied by camera movements, which means that the viewpoints need to be reconstructed along a specified trajectory. We present AnimateScene, which addresses above issues in a unified framework. First, we design an accurate placement module that automatically determines a plausible 3D position for the human and prevents any interpenetration within the scene during motion. Second, we propose a training‑free style alignment method that adapts the 4D human representation to match the background’s lighting and style, achieving coherent visual integration. Finally, we design a joint post‑reconstruction method for both the 4D human and the 3D scene that allows camera trajectories to be inserted, enabling the final rendered video to feature visually appealing camera movements. Extensive experiments show that our AnimateScene generates dynamic scene videos with high geometric detail and spatiotemporal coherence across various camera and action combinations.

Method

MY ALT TEXT

AnimateScene takes a single scene image, a single human image, an accompanying motion clip, and a user-defined camera path. It first aligns the human’s appearance with the scene. Next, it reconstructs a 4D Gaussian avatar alongside a sparse 3D Gaussian scene. Depth cues are then used to lift the avatar to a collision-free 3D location and scale. The system finally refines the fused human–scene field along the camera trajectory, inpainting any newly exposed regions. The result is a video where actor, environment, and viewpoint move in seamless geometric and stylistic harmony.

Qualitative Visualizations

More qualitative visualizations from AnimateScene as shown below. The first column is the input human image, the second column is the input scene image, and the third column is the output rendered video.