Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

1Zhejiang University,  2Kuaishou Technology,  3Tsinghua University,  4CUHK

TL;DR: We propose SynCamMaster, an efficient method to lift pre-trained text-to-video models for open-domain multi-camera video generation from diverse viewpoints.


Demos

Text prompt:
Row 1: A hungry man enthusiastically devouring a steaming plate of spaghetti.
Row 2: A chef is expertly chopping onions in a well-equipped kitchen.
Row 3: A young and beautiful girl dressed in a pink dress, playing a grand piano.
Row 4: An elephant wearing a colorful birthday hat is walking along the sandy beach.

Cameras with 30° Difference in Azimuth


Cameras with Difference in Distance


Cameras with 15° Difference in Elevation


Cameras with 20° Difference in Azimuth and 10° Difference in Elevation




More Results

Abstract

Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also design a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints.

Method

To synthesis multi-camera synchronized videos based on the pre-trained text-to-video model, two components are newly introduced: the camera encoder projects the relative camera extrinsic parameters into embedding space; the inter-view synchronization module, as plugged in each Transformer block, modulates inter-view features under the guidance of inter-camera relationship. Only new components are trainable, while the pre-trained text-to-video model remains frozen.


Comparisons

Awesome Related Works:
3DTrajMaster: control multiple entity motions in 3D space (6DoF) for text-to-video generation.
StyleMaster: enable artistic video generation and translation with reference style image.
GCD: synthesize large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
CVD: multi-view video generation with multiple camera trajectories.
SV4D: multi-view consistent dynamic 3D content generation.

Reference:
[1] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).
[2] Wang, Zhouxia, et al. "Motionctrl: A unified and flexible motion controller for video generation." ACM SIGGRAPH 2024 Conference Papers. 2024.
[3] He, Hao, et al. "Cameractrl: Enabling camera control for text-to-video generation." arXiv preprint arXiv:2404.02101 (2024).
[4] Wu, Guanjun, et al. "4d gaussian splatting for real-time dynamic scene rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Acknowledgments:
We thank Jinwen Cao, Yisong Guo, Haowen Ji, Jichao Wang, and Yi Wang from Kuaishou Technology for their invaluable help in constructing the SynCamVideo-Dataset. We thank Guanjun Wu and Jiangnan Ye for their help on running 4DGS.