SemanticGen: Video Generation in Semantic Space

Jianhong Bai¹, Xiaoshi Wu², Xintao Wang², Xiao Fu³, Yuanxing Zhang², Qinghe Wang⁴, Xiaoyu Shi², Menghan Xia⁵, Zuozhu Liu¹, Haoji Hu¹, Pengfei Wan², Kun Gai²

¹Zhejiang University, ²Kling Team, Kuaishou Technology, ³CUHK, ⁴DLUT, ⁵HUST

Paper

Quick Overview

We propose SemanticGen, a novel framework that generates videos in a high-level semantic space before refining details in the VAE latent space. Our key insight is that, given the substantial redundancy inherent in videos, generation should first occur in a compact semantic space for global planning, and add high-frequency details afterwards — rather than directly modeling vast collections of low-level video tokens.

Abstract

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

Method

SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output.

(a) We optimized a latent diffusion model for denoising video VAE latents conditioned on their semantic representations.
(b) We train a semantic generator to fit the compressed semantic representation distribution of off-the-shelf semantic encoders.
(c) During inference, we integrate the semantic generator and VAE latent generator to achieve high-quality T2V generation.

Green: Input

Yellow: Output

Blue: Trainable models

Gray: Frozen models

Demos

Long Video Generation

A realistic video showing a young woman walking along a lakeside path while using her phone, and a man in a modern office talking on his phone, captured through a mix of wide and close-up shots that emphasize their gestures, and the contrast between the natural, sunlit outdoor setting and the clean, softly lit indoor environment.

A realistic video showing an older military officer and a younger man reviewing and discussing documents while walking and pausing in a dimly lit institutional hallway, captured through alternating close-ups and medium shots that highlight their serious facial expressions, and the tense atmosphere of their interaction.

A realistic video showing an older woman with light skin and silver hair in a peaceful garden filled with flowers and greenery, wearing a sunhat and a long floral dress as she gently tends to her plants, captured with a mix of medium shots and close-ups that highlight her calm expression and the vibrant, sunlit surroundings.

A realistic video set at dawn on a tranquil beach, showing a woman of medium skin tone with a calm, contemplative expression strolling barefoot along the shoreline, captured through occasional close-ups that emphasize the soft morning light, the quiet rhythm of the waves, and the relaxed, introspective mood of her solitary walk.

A realistic video that begins with a close-up of a single lily pad floating on a still pond and slowly expands to wider shots as mist rises and then dissipates, ending on a broad view of the entire pond and its reflections, captured to emphasize the gradual change in light, atmosphere, and tranquil mood.

A realistic video showing a young man and woman engaged in an intense conversation on a cobblestone path in a park, captured through alternating close-ups and medium shots that highlight their shifting expressions, tense body language, and the intimate atmosphere of their interaction within the quiet outdoor setting.

A realistic video showing a man and a woman engaged in an intense conversation indoors, captured through alternating close-up and medium shots that highlight their serious expressions. The man is wearing a dark suit jacket over a light blue collared shirt and a dark tie, while the woman is wearing a simple white sweater, both framed in a softly lit, neutral interior setting.

Two women in professional attire engage in a serious conversation in a modern office setting. The first is an East Asian woman, with short, dark brown hair styled in a bob with bangs. She is wearing a dusty rose-colored turtleneck sweater under a matching blazer and small, sparkling stud earrings. The second subject is another East Asian woman, with long, dark brown hair tied back in a low ponytail.

A realistic video showing two men engaged in a serious conversation in a dimly lit, high-tech office at night, captured through alternating close-ups and medium shots that emphasize their focused expressions and the moody, professional atmosphere. Subtle blue light from electronic displays create reflections on glass surfaces, adding a tense, modern tone to their interaction.

A young East Asian man and woman have an emotional conversation in a dimly lit indoor setting, with alternating close-up shots that emphasize their expressions.

Two middle-aged men, one in a black jacket and the other in a dark suit, engage in a serious conversation within a modern, dimly lit interior space.

A middle-aged African American woman and man engage in an emotional conversation indoors, while two children playfully enter the room.

A skier in a red jacket moves along a snow-covered mountain slope, against a backdrop of majestic mountains and a golden sunset.

A vast, snow-capped mountain range with a turquoise glacial lake and winding rivers is captured in a panoramic view as the camera slowly pulls back to reveal the expansive landscape.

A cozy cottage in a dense forest at night is surrounded by a campfire, glowing lanterns, and a flowing stream, with leaves falling and fireflies twinkling.

A serene, dark mountain landscape is reflected in calm water under a night sky illuminated by the vibrant green and purple aurora borealis.

A middle-aged man and a young woman engage in a serious conversation in an office setting, captured through alternating close-up shots.

A young woman with reddish-brown hair stands in a mountainous landscape during sunset, gazing thoughtfully as the wind blows her hair.

Short Video Generation

A bee hovers above a flower, its wings beating rapidly in the soft light of the morning.

A golden feather floats gently on a windless, glass-like lake, its edges glowing softly as it drifts across the water’s surface.

A giant flower blooms in the center of a frozen lake, its petals glowing with a soft, ethereal light.

A man in a dark suit walks down a dusty street in a small town at dusk, with a mountainous landscape in the background.

A skier navigates a snowy slope at sunset, making sharp turns and kicking up snow.

Two individuals in Spider-Man costumes sit side by side on a rooftop, gazing at a cityscape at dusk.

Two boxers engage in a fierce match inside a dimly lit cage surrounded by an enthusiastic audience.

A car drives through a neon-lit cyberpunk city street, passing by a motel with a red neon sign.

A man in a black tuxedo raises a glass in a celebratory toast against a backdrop of sparkling lights and reflective surfaces.

A traditional wooden structure is perched on a cliffside in a serene mountainous landscape at sunset.

A skier in a red jacket moves along a snow-covered mountain slope, against a backdrop of majestic mountains and a golden sunset.

A sleek, silver sports car drives along a wet beach, leaving a trail of sand and water spray behind it.

A person is riding a bike along a winding mountain road under a clear blue sky.

A playful golden retriever playing in a lush green field.

Two Asian men on a rooftop overlooking a city, where an older man in a black jacket points a gun at a younger man in a suit.

Visualization: Video Generation from Semantic Representations of a Reference Video

To verify that the compressed semantic representation captures the video’s high-level semantics and effectively guides generation, we extract semantic features from a reference video and inject them into the VAE latent generator. The generated video, shown below, preserves the spatial layout and motion patterns of the reference video while differing in fine details. This demonstrates that the compressed semantic representations encode high-level information—such as structure and dynamics, while discarding low-level attributes like texture and color.

Reference Video

w. 2048-dim Sem. Rep.

w. 64-dim Sem. Rep.

w. 8-dim Sem. Rep.

w.o. Sem. Rep.

A middle-aged Asian man in a white tuxedo sits in a warmly lit room, holding a cloth and crying, as the camera pulls back to reveal his emotional expression.

Reference Video

w. 2048-dim Sem. Rep.

w. 64-dim Sem. Rep.

w. 8-dim Sem. Rep.

w.o. Sem. Rep.

A man and a woman lie on the floor of a dimly lit hallway, facing each other and sharing a meal from a pile of food between them, with the hallway’s staircase in the background.

Ablation on Semantic Space Compression

We propose to compress the semantic representation space using a lightweight MLP for efficient training. We use the vision tower of Qwen2.5VL-3B-Instruct [1] as the semantic encoder, where the vanilla semantic representation has a dimension of 2048. We first train three VAE latent generators using: (1) no MLP, (2) an MLP with 64 output channels, and (3) an MLP with 8 output channels, each for 10K steps. Based on these models, we further train three corresponding semantic generation models for 50K steps. During inference, we first use the semantic generator to produce the video semantic representation, which is then used as a condition input to the VAE latent generation model to map it into the VAE space. As shown below, we observe that the visual quality of the generated videos improves as the dimensionality decreases, exhibiting fewer broken frames and artifacts. This indicates that compressing the pre-trained semantic representation space to a lower dimension accelerates the convergence of the semantic generator.

Without Compression (2048-dim)

With Compression (64-dim)

With Compression (8-dim)

Two Asian men on a rooftop overlooking a city, where an older man in a black jacket points a gun at a younger man in a suit, the camera slowly pulls away.

Without Compression (2048-dim)

With Compression (64-dim)

With Compression (8dim)

A man in a gray suit and a woman in a black dress dancing together in a lavish, warmly lit ballroom surrounded by elegantly dressed guests and ornate decorations.

Ablation on Generation in Semantic Space vs. VAE Space

In this paper, we propose to first learn compact semantic representations and then map them into the VAE latent space. A natural question arises: Does leveraging semantic representations truly benefit video generation? In other words, what happens if we adopt the same two-stage pipeline but learn compact VAE latents instead of semantic representations [2]? To investigate this, we keep the SemanticGen framework unchanged except for replacing the semantic encoder with a VAE encoder, training a generator to model compressed VAE latents rather than semantic features. Both the semantic generator and the VAE latent generator are trained from scratch for 10K steps, and the results are shown below. We observe that modeling in the VAE space leads to significantly slower convergence, as the generated results only contain coarse color patches. In contrast, the model trained in the semantic space is already able to produce reasonable videos under the same number of training steps. This demonstrates that the proposed SemanticGen framework effectively accelerates the convergence of diffusion-based video generation models.

VAE Space

Semantic Space

A muscular man in a black tank top gestures while speaking to a man in a blue jacket, who listens attentively amid workout equipment in the background.

VAE Space

Semantic Space

An aerial shot shows a coastal landscape with a winding river through dunes and greenery, contrasting with scattered settlements under clear skies.

Comparisons

We compared the proposed SemanticGen with state-of-the-art T2V methods. For short video generation, we use Wan2.1-T2V-14B [3], and HunyuanVideo [4] as baselines. For long video generation, we use open-source models SkyReels-V2 [5], Self-Forcing [6], and LongLive [7] as baselines. To provide a reliable assessment of our proposed paradigm, we include additional baselines that continue training the base model using the standard diffusion loss without semantic modeling, while keeping the data and the number of training steps identical. These comparisons are included as important baselines, denoted as Base-CT and Base-Swin-CT.

HunyuanVideo

Wan2.1-T2V-14B

Base-CT

SemanticGen

A snowflake lands on a dark windowsill, its intricate shape briefly visible before melting away. The camera captures the fleeting beauty as the snowflake melts.

HunyuanVideo

Wan2.1-T2V-14B

Base-CT

SemanticGen

A man in a suit and hat walks down a dusty street, with a mountain in the background, then turns his head left as he approaches the camera.

MAGI-1

SkyReels-V2

Self-Forcing

Longlive

HoloCine

Base-Swin-CT

SemanticGen

The video showcases a mountain ridge with swirling clouds, as sunlight breaks through, captured in a wide shot with high dynamic range and crisp lighting.

MAGI-1

SkyReels-V2

Self-Forcing

Longlive

HoloCine

Base-Swin-CT

SemanticGen

A group of people engages in a serious conversation in a modern home, captured through alternating close-ups and medium shots that emphasize their expressions.

Reference:
[1] Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).
[2] Ouyang, Wenqi, et al. "TokensGen: Harnessing Condensed Tokens for Long Video Generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[3] Wan, Team, et al. "Wan: Open and advanced large-scale video generative models." arXiv preprint arXiv:2503.20314 (2025).
[4] Kong, Weijie, et al. "Hunyuanvideo: A systematic framework for large video generative models." arXiv preprint arXiv:2412.03603 (2024).
[5] Chen, Guibin, et al. "Skyreels-v2: Infinite-length film generative model." arXiv preprint arXiv:2504.13074 (2025).
[6] Huang, Xun, et al. "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion." arXiv preprint arXiv:2506.08009 (2025).
[7] Yang, Shuai, et al. "Longlive: Real-time interactive long video generation." arXiv preprint arXiv:2509.22622 (2025).