Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also design a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints.
To synthesis multi-camera synchronized videos based on the pre-trained text-to-video model, two components are newly introduced: the camera encoder projects the relative camera extrinsic parameters into embedding space; the inter-view synchronization module, as plugged in each Transformer block, modulates inter-view features under the guidance of inter-camera relationship. Only new components are trainable, while the pre-trained text-to-video model remains frozen.
To the best of our knowledge, multi-view real-world video generation has not been explored by previous works. To this end, we establish baseline approaches by first extracting the first frame of each view generated by SynCamMaster, and then feeding them into 1) image-to-video (I2V) generation method, i.e., SVD-XT [1] 2) state-of-the-art single-video camera control approach MotionCtrl [2] based on SVD and CameraCtrl [3] based on SVD-XT. Furthermore, we additionally train an I2V generation model based on the same T2V model used by SynCamMaster, denote as 'I2V-Ours'.
Viewpoint 1 Viewpoint 2 | Viewpoint 1 Viewpoint 2 | |
---|---|---|
"An elephant wearing a colorful birthday hat is walking along the sandy beach." |
"A blue bus drives across the iconic Tower Bridge in London." |
We use SynCamMaster to synthesize the baseline methods' reference images since they cannot generate videos from various viewpoints.