NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

Xiaokun Feng^1,2,3, Haiming Yu³, Meiqi Wu^3,4*, Shiyu Hu⁵, Jintao Chen^3,6,
Chen Zhu^3,, Jiahong Wu³, Xiangxiang Chu³, Kaiqi Huang^1,2,

¹ School of Artificial Intelligence, UCAS    ² CASIA    ³ AMAP, Alibaba Group
⁴ School of Computer Science and Technology, UCAS
⁵ School of Physical and Mathematical Sciences, NTU    ⁶ PKU

Paper GitHub

Huggingface

Abstract

With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to greater content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (\eg, VBench). To comprehensively assess the Narrative expression capabilities of Long Video generation models, we propose NarrLV - a novel benchmark inspired by film narrative theory. (i) First, we introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models that underpin them. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

I. Framework of our NarrLV

(a) Our prompt suite is inspired by film narrative theory and identifies three key factors influencing Temporal Narrative Atom (TNA) transitions. Based on these, we construct a prompt generation pipeline capable of producing evaluation prompts with flexibly adjustable TNA counts. (b) Our evaluation models include long video generation models and the foundation models they often rely on. (c) Based on the progressive expression of narrative content, we conduct evaluations from three dimensions, employing an MLLM-based question generation and answering framework for calculations. Our metric is well-aligned with human preferences.

II. Evaluation Results

Our evaluation model encompasses existing long video generation models as well as the foundational generation models they typically rely on:

Note: We will continuously update the evaluation results of the latest long video generation models.

III. Extensible TNA-Driven Prompt Suite

The number of Temporal Narrative Atoms (TNA) serves as a quantitative measure of narrative richness. Unlike existing representative benchmarks that concentrate on prompts with only a small number of TNAs in a narrow range, our innovative prompt suite can flexibly expand narrative content richness, thereby enabling a thorough assessment of the full narrative capabilities of long video generation models. Its word cloud clearly shows that words like "suddenly," "next," and "finally," which pertain to the progression of narrative content, hold significant weight, aligning with our narrative-centric evaluation objectives.

IV. Progressive Narrative-Expressive Evaluation Metric

To systematically evaluate the narrative quality of long video generation, we introduce three core metrics: Narrative Element Fidelity (R_fid), Narrative Unit Coverage (R_cov), and Narrative Unit Coherence (R_coh). Among these, R_fid focuses on the generation performance of narrative elements represented by scenes and objects. R_cov and R_coh emphasize the generation quality of narrative units composed of narrative elements. In addition, we utilize a recently popular MLLM-based question generation and answer framework for quantitative evaluation.

V. Video Demonstrations

For the three factors that we consider crucial to narrative richness (i.e., scene attribute change, object attribute change, and object action change), we present the corresponding evaluation prompts and generated video cases for each specified TNA level.

Scene Attribute Change

Prompt: (#TNA=1) In a picturesque setting, surrounded by verdant hills and vibrant flowers, a plush bear stands among the blossoms.

Model: CogVideoX1.5-5B

Prompt: (#TNA=2) At a bustling harbor area, rows of colorful shipping containers are neatly aligned, and the sky is overcast. As the scene progresses, the clouds part, revealing a bright blue sky and bathing the containers in warm sunlight.

Model: HunyuanVideo

Prompt: (#TNA=3) In the vast desert, a majestic pyramid towers in the scorching noon sun, its golden stones shimmering brightly. As the sun sets, the sky turns a deep orange. Finally, night falls, and the pyramid is bathed in the soft glow of moonlight.

Model: FreeNoise

Prompt: (#TNA=4) In the vibrant and colorful farm simulation game interface, the screen first shows a sunny day with green fields and crops. Suddenly, the weather changes to a rainy scene, with droplets covering the screen and the fields turning darker due to wet soil. Then, snow starts to fall, blanketing the entire landscape in white with the game interface elements subtly adjusting to the snowy environment. Finally, the snow melts away as the sun comes back out, restoring the lush green fields and brightening the entire interface once again.

Model: FreeLong

Prompt: (#TNA=5) In a stage setting, a spotlight illuminates a microphone at the center of the stage. Suddenly, the lights dim, casting the entire room in shadow. Then, red neon lights brighten the scene, adding a dramatic atmosphere. Next, the lights transition to a cool blue, shifting the ambiance to a more serene state. Finally, the space is bathed in soft natural sunlight, bringing a tranquil glow to the environment.

Model: Open-Sora-Plan

Prompt: (#TNA=6) In a bustling futuristic cityscape, neon lights flash across towering skyscrapers, casting vibrant hues onto the streets below. Suddenly, a thick fog rolls in, muting the bright colors and shrouding the scene in mist. Then, the fog clears, revealing a clear night sky dotted with stars, making the entire city appear more serene. Shortly after, an unexpected electrical blackout plunges the city into darkness, creating an eerie atmosphere. The power is restored, but the neon lights now flicker erratically, causing a chaotic visual effect. Finally, dawn breaks, and the first light of day washes over the city, reinstating calm and clarity.

Model: Wan2.1

Object Attribute Change

Prompt: (#TNA=1) In a war-torn landscape filled with billowing smoke, Iron Man, clad in shining red and gold armor, stands on the ground.

Model: Wan2.1

Prompt: (#TNA=2) Inside a tranquil temple interior, a Buddha statue made of white marble sits serenely on an ornate pedestal. Gradually, the statue's material shifts from white marble to shimmering gold.

Model: Open-Sora-Plan

Prompt: (#TNA=3) In a modern laboratory, a square silver metal plate rests quietly on a workbench. Gradually, the color of the metal plate changes from silver to gold. Finally, the square gold metal plate transforms into a circular gold metal plate.

Model: FreeLong

Prompt: (#TNA=4) The scene shows a torch, with a faint yellow light emanating from the flame above it. Gradually, the light of the flame changes from yellow to blue. Then, the light of the flame transitions from blue to red. Finally, the light of the flame changes from blue to green.

Model: CogVideoX1.5-5B

Prompt: (#TNA=5) In an animated historical scene, an old wooden bridge stands over a calm river. Gradually, the wooden bridge's material changes to stone. Next, the stone bridge transforms into an iron bridge. Then, the iron bridge changes into a golden bridge. Finally, the golden bridge turns into a crystal bridge.

Model: FIFO-Diffusion

Prompt: (#TNA=6) At a lively beach party, a person is dancing gleefully near the shoreline. Initially, the person's face is adorned with a bright, cheerful expression. Suddenly, the expression on the face changes to surprise. Next, the face transitions to a look of confusion. Then, the expression shifts to one of contemplation. After that, the face changes to a look of determination. Finally, the expression evolves into one of sheer joy.

Model: HunyuanVideo

Object Action Change

Prompt: (#TNA=1) In a vibrant and colorful children's cartoon hallway, a little girl joyfully hops and skips.

Model: Open-Sora-Plan

Prompt: (#TNA=2) In a warmly lit music performance room, a young man sits at a piano, focusing intently on the sheet music before him. After a moment, he begins to play the piano.

Model: HunyuanVideo

Prompt: (#TNA=3) In a tidy laboratory, a flask filled with liquid is placed on the lab bench, and a scientist stands in front of the bench. Next, the scientist picks up the flask to observe it. Then, the scientist begins to shake the flask.

Model: FreeNoise

Prompt: (#TNA=4) On an esports competition stage, a player wearing headphones is seated intently in front of a computer setup. Initially, the player rapidly types on the keyboard. Next, the player suddenly stands up. Then, the player removes the headphones. Finally, the player raises both arms in victory.

Model: Wan2.1

Prompt: (#TNA=5) In a futuristic video game cityscape, towering skyscrapers and bustling vehicles create a dynamic scene, with an advanced hovercar swiftly moving along the street. Suddenly, the hovercar stops in front of a massive skyscraper. It then ascends vertically along the building's façade. Next, the vehicle hovers in the air, transforming into a flying drone. Afterwards, the drone begins to circle around the skyscraper.

Model: FIFO-Diffusion

Prompt: (#TNA=6) In a reading room, there is a bookshelf against the wall filled with various books and newspapers. A young student stands next to the bookshelf, contemplating which book to choose. Then, the student pulls a book off the shelf. After that, the student opens the book and begins to read. Next, the student closes the book and puts it back on the shelf. Following that, the student takes another newspaper from the shelf. Finally, the student sits down to read the newspaper.

Model: FreeLong