We propose SemanticGen, a novel framework that generates videos in a high-level semantic space before refining details in the VAE latent space. Our key insight is that, given the substantial redundancy inherent in videos, generation should first occur in a compact semantic space for global planning, and add high-frequency details afterwards — rather than directly modeling vast collections of low-level video tokens.
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output.
(a) We train a semantic generator to fit the compressed semantic representation distribution of off-the-shelf semantic encoders. (b) We optimized a latent diffusion model for denoising video VAE latents conditioned on their semantic representations.
(c) During inference, we integrate the semantic generator and VAE latent generator to achieve high-quality T2V generation.
To verify that the compressed semantic representation captures the video’s high-level semantics and effectively guides generation, we extract semantic features from a reference video and inject them into the VAE latent generator. The generated video, shown below, preserves the spatial layout and motion patterns of the reference video while differing in fine details. This demonstrates that the compressed semantic representations encode high-level information—such as structure and dynamics, while discarding low-level attributes like texture and color.
We propose to compress the semantic representation space using a lightweight MLP for efficient training. We use the vision tower of Qwen2.5VL-3B-Instruct [1] as the semantic encoder, where the vanilla semantic representation has a dimension of 2048. We first train three VAE latent generators using: (1) no MLP, (2) an MLP with 64 output channels, and (3) an MLP with 8 output channels, each for 10K steps. Based on these models, we further train three corresponding semantic generation models for 50K steps. During inference, we first use the semantic generator to produce the video semantic representation, which is then used as a condition input to the VAE latent generation model to map it into the VAE space. As shown below, we observe that the visual quality of the generated videos improves as the dimensionality decreases, exhibiting fewer broken frames and artifacts. This indicates that compressing the pre-trained semantic representation space to a lower dimension accelerates the convergence of the semantic generator.
In this paper, we propose to first learn compact semantic representations and then map them into the VAE latent space. A natural question arises: Does leveraging semantic representations truly benefit video generation? In other words, what happens if we adopt the same two-stage pipeline but learn compact VAE latents instead of semantic representations [2]? To investigate this, we keep the SemanticGen framework unchanged except for replacing the semantic encoder with a VAE encoder, training a generator to model compressed VAE latents rather than semantic features. Both the semantic generator and the VAE latent generator are trained from scratch for 10K steps, and the results are shown below. We observe that modeling in the VAE space leads to significantly slower convergence, as the generated results only contain coarse color patches. In contrast, the model trained in the semantic space is already able to produce reasonable videos under the same number of training steps. This demonstrates that the proposed SemanticGen framework effectively accelerates the convergence of diffusion-based video generation models.
We compared the proposed SemanticGen with state-of-the-art T2V methods. For short video generation, we use Wan2.1-T2V-14B [3], and HunyuanVideo [4] as baselines. For long video generation, we use open-source models SkyReels-V2 [5], Self-Forcing [6], and LongLive [7] as baselines. To provide a reliable assessment of our proposed paradigm, we include additional baselines that continue training the base model using the standard diffusion loss without semantic modeling, while keeping the data and the number of training steps identical. These comparisons are included as important baselines, denoted as Base-CT and Base-Swin-CT.