✨Video-T1: Test-Time Scaling for Video Generation✨

Home / Article MrLee 4天前 39

Video-T1: We present the generative effects and performance improvements of video generation under test-time scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.

📢 News

  • 2025.3.24 🤗🤗🤗 We release Video-T1: Test-time Scaling for Video Generation

🎉 Results

Results Visualization

Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models’ performance exhibits consistent improvement.

🌟 Pipeline

Pipeline Visualization

Pipeline of Test-Time Scaling for Video Generation. Top: Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate a sequence of video clips through step-by-step denoising in a linear manner, and select the highest score from the test verifiers. Bottom: Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers V to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.

🔧 Installation

Dependencies:

git clone https://github.com/liuff19/Video-T1.git
cd VideoT1
conda create -n videot1 python==3.10
conda activate videot1
pip install -r requirements.txt
git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]"

Model Checkpoints:

You need to download the following models:

  • Pyramid-Flow model checkpoint (for video generation)
  • VisionReward-Video model checkpoint (for video reward guidance)
  • (Optional) Image-CoT-Generation model checkpoint (for ImageCoT)
  • (Optional) DeepSeek-R1-Distill-Llama-8B (Or other LLMs) model checkpoint (for hierarchical prompts)

💻 Inference

1. Quick start

1
cd VideoT1# Modify videot1.py to assign checkpoints correctly.python -m videot1.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard

2.Inference Code

For inference, please refer to videot1.py for usage.

1
2
3
4
5
6
7
8
9
10
# Import Pipeline and Base Modelfrom pyramid_flow.pyramid_dit import PyramidDiTForVideoGenerationfrom pipeline.videot1_pipeline import VideoT1Generator# Initialize Pyramid-Flow Modelpyramid_model = init_pyramid_model(model_path, device, model_variant)# Initialize VisionReward Modelreward_model, tokenizer = init_vr_model(vr_path, device)# Initialize VideoT1 Generatorgenerator = VideoT1Generator(    pyramid_model,    device,    dtype=torch.bfloat16, 
    image_selector_path=imgcot_path,    result_path=result_path,    lm_path=lm_path,
)# Courtesy of Pyramid-Flow# Use the generator to generate videos using TTS strategybest_video = generator.videot1_gen(    prompt=prompt,    num_inference_steps=[20, 20, 20],      # Inference steps for image branch at each level
    video_num_inference_steps=[20, 20, 20], # Inference steps for video branch at each level
    height=height,    width=width,    num_frames=temp,    guidance_scale=7.0,           
    video_guidance_scale=5.0,      
    save_memory=True,             
    inference_multigpu=True,      
    video_branching_factors=video_branch,    image_branching_factors=img_branch,    reward_stages=reward_stages,    hierarchical_prompts=True,     
    result_path=result_path,    intermediate_path=intermed_path,    video_name=video_name,    **reward_params                )

3.Multi-GPU Inference

Save GPU Memory by loading different models on different GPUs to avoid OOM problem.

Example: Load Reward Model in GPU0, Pyramid-Flow in GPU1 and Image-CoT model in GPU2

1
# Load Models in different GPUspython videot1_multigpu.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard --reward_device_id 0 --base_device_id 1 --imgcot_device_id 2 --lm_device_id 3

Please refer to videot1_multigpu.py for multi-GPU inference.

4.Usage Tips

  1. reward_stages: Choose three indices for reward model pruning. If tree's depth is in three indices, all video clips would be fed into reward models for judging.

  2. variant: We recommend 768p for better quality, choose from 384,768 (same as Pyramid-Flow)

  3. img_branch: A list of integers, each correspond to the number of images at the beginning of ImageCoT process at this depth.

  4. video_branch: A list of integers, each correspond to the number of generated next frames at this depth.
    Namely, if img_branch is array A[]video_branch is array B[]then at depth iwe would have A[i]×B[i] initial images for each branch, and B[i] next latent frames would be the children for each branch.

🚀 TODO

We would release Dataset for Test-Time Scaling in CogVideoX-5B

Acknowledgement

We are thankful for the following great works when implementing Video-T1 and Yixin's great figure design:

Pyramid-Flow
NOVA
VisionReward
VideoLLaMA3
CogVideoX
OpenSora
Image-Generation-CoT

📚 Citation

1
2
@misc{liu2025videot1testtimescalingvideo,        title={Video-T1: Test-Time Scaling for Video Generation}, 
        author={Fangfu Liu and Hanyang Wang and Yimo Cai and Kaiyan Zhang and Xiaohang Zhan and Yueqi Duan},        year={2025},        eprint={2503.18942},        archivePrefix={arXiv},        primaryClass={cs.CV},        url={https://arxiv.org/abs/2503.18942},


本文链接:https://it72.com/12784.htm

上传的附件:
推荐阅读
最新回复 (0)
返回