SemanticGen: Video Generation in Semantic Space

1Zhejiang University,  2Kling Team, Kuaishou Technology,  3CUHK,  4DLUT,  5HUST

Quick Overview

Abstract

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

Method

SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output.



Demos

Long Video Generation

A realistic video showing a young woman walking along a lakeside path while using her phone, and a man in a modern office talking on his phone, captured through a mix of wide and close-up shots that emphasize their gestures, and the contrast between the natural, sunlit outdoor setting and the clean, softly lit indoor environment.
A realistic video showing an older military officer and a younger man reviewing and discussing documents while walking and pausing in a dimly lit institutional hallway, captured through alternating close-ups and medium shots that highlight their serious facial expressions, and the tense atmosphere of their interaction.
A realistic video showing an older woman with light skin and silver hair in a peaceful garden filled with flowers and greenery, wearing a sunhat and a long floral dress as she gently tends to her plants, captured with a mix of medium shots and close-ups that highlight her calm expression and the vibrant, sunlit surroundings.
A realistic video set at dawn on a tranquil beach, showing a woman of medium skin tone with a calm, contemplative expression strolling barefoot along the shoreline, captured through occasional close-ups that emphasize the soft morning light, the quiet rhythm of the waves, and the relaxed, introspective mood of her solitary walk.
A realistic video that begins with a close-up of a single lily pad floating on a still pond and slowly expands to wider shots as mist rises and then dissipates, ending on a broad view of the entire pond and its reflections, captured to emphasize the gradual change in light, atmosphere, and tranquil mood.
A realistic video showing a young man and woman engaged in an intense conversation on a cobblestone path in a park, captured through alternating close-ups and medium shots that highlight their shifting expressions, tense body language, and the intimate atmosphere of their interaction within the quiet outdoor setting.
A realistic video showing a man and a woman engaged in an intense conversation indoors, captured through alternating close-up and medium shots that highlight their serious expressions. The man is wearing a dark suit jacket over a light blue collared shirt and a dark tie, while the woman is wearing a simple white sweater, both framed in a softly lit, neutral interior setting.
Two women in professional attire engage in a serious conversation in a modern office setting. The first is an East Asian woman, with short, dark brown hair styled in a bob with bangs. She is wearing a dusty rose-colored turtleneck sweater under a matching blazer and small, sparkling stud earrings. The second subject is another East Asian woman, with long, dark brown hair tied back in a low ponytail.
A realistic video showing two men engaged in a serious conversation in a dimly lit, high-tech office at night, captured through alternating close-ups and medium shots that emphasize their focused expressions and the moody, professional atmosphere. Subtle blue light from electronic displays create reflections on glass surfaces, adding a tense, modern tone to their interaction.
A young East Asian man and woman have an emotional conversation in a dimly lit indoor setting, with alternating close-up shots that emphasize their expressions.
Two middle-aged men, one in a black jacket and the other in a dark suit, engage in a serious conversation within a modern, dimly lit interior space.
A middle-aged African American woman and man engage in an emotional conversation indoors, while two children playfully enter the room.
A skier in a red jacket moves along a snow-covered mountain slope, against a backdrop of majestic mountains and a golden sunset.
A vast, snow-capped mountain range with a turquoise glacial lake and winding rivers is captured in a panoramic view as the camera slowly pulls back to reveal the expansive landscape.
A cozy cottage in a dense forest at night is surrounded by a campfire, glowing lanterns, and a flowing stream, with leaves falling and fireflies twinkling.
A serene, dark mountain landscape is reflected in calm water under a night sky illuminated by the vibrant green and purple aurora borealis.
A middle-aged man and a young woman engage in a serious conversation in an office setting, captured through alternating close-up shots.
A young woman with reddish-brown hair stands in a mountainous landscape during sunset, gazing thoughtfully as the wind blows her hair.


Short Video Generation

A bee hovers above a flower, its wings beating rapidly in the soft light of the morning.
A golden feather floats gently on a windless, glass-like lake, its edges glowing softly as it drifts across the water’s surface.
A giant flower blooms in the center of a frozen lake, its petals glowing with a soft, ethereal light.
A man in a dark suit walks down a dusty street in a small town at dusk, with a mountainous landscape in the background.
A skier navigates a snowy slope at sunset, making sharp turns and kicking up snow.
Two individuals in Spider-Man costumes sit side by side on a rooftop, gazing at a cityscape at dusk.
Two boxers engage in a fierce match inside a dimly lit cage surrounded by an enthusiastic audience.
A car drives through a neon-lit cyberpunk city street, passing by a motel with a red neon sign.
A man in a black tuxedo raises a glass in a celebratory toast against a backdrop of sparkling lights and reflective surfaces.
A traditional wooden structure is perched on a cliffside in a serene mountainous landscape at sunset.
A skier in a red jacket moves along a snow-covered mountain slope, against a backdrop of majestic mountains and a golden sunset.
A sleek, silver sports car drives along a wet beach, leaving a trail of sand and water spray behind it.
A person is riding a bike along a winding mountain road under a clear blue sky.
A playful golden retriever playing in a lush green field.
Two Asian men on a rooftop overlooking a city, where an older man in a black jacket points a gun at a younger man in a suit.

Visualization: Video Generation from Semantic Representations of a Reference Video

Ablation on Semantic Space Compression

Ablation on Generation in Semantic Space vs. VAE Space

Comparisons

HunyuanVideo
Wan2.1-T2V-14B
Base-CT
SemanticGen
A snowflake lands on a dark windowsill, its intricate shape briefly visible before melting away. The camera captures the fleeting beauty as the snowflake melts.
HunyuanVideo
Wan2.1-T2V-14B
Base-CT
SemanticGen
A man in a suit and hat walks down a dusty street, with a mountain in the background, then turns his head left as he approaches the camera.
MAGI-1
SkyReels-V2
Self-Forcing
Longlive
HoloCine
Base-Swin-CT
SemanticGen
The video showcases a mountain ridge with swirling clouds, as sunlight breaks through, captured in a wide shot with high dynamic range and crisp lighting.
MAGI-1
SkyReels-V2
Self-Forcing
Longlive
HoloCine
Base-Swin-CT
SemanticGen
A group of people engages in a serious conversation in a modern home, captured through alternating close-ups and medium shots that emphasize their expressions.


Reference:
[1] Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).
[2] Ouyang, Wenqi, et al. "TokensGen: Harnessing Condensed Tokens for Long Video Generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
[3] Wan, Team, et al. "Wan: Open and advanced large-scale video generative models." arXiv preprint arXiv:2503.20314 (2025).
[4] Kong, Weijie, et al. "Hunyuanvideo: A systematic framework for large video generative models." arXiv preprint arXiv:2412.03603 (2024).
[5] Chen, Guibin, et al. "Skyreels-v2: Infinite-length film generative model." arXiv preprint arXiv:2504.13074 (2025).
[6] Huang, Xun, et al. "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion." arXiv preprint arXiv:2506.08009 (2025).
[7] Yang, Shuai, et al. "Longlive: Real-time interactive long video generation." arXiv preprint arXiv:2509.22622 (2025).