Video-T1: We present the generative effects and performance improvements of video generation under test-time scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.
2025.3.24
🤗🤗🤗 We release Video-T1: Test-time Scaling for Video Generation
Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models’ performance exhibits consistent improvement.
Pipeline of Test-Time Scaling for Video Generation. Top: Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate a sequence of video clips through step-by-step denoising in a linear manner, and select the highest score from the test verifiers. Bottom: Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers V to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.
git clone https://github.com/liuff19/Video-T1.git cd VideoT1 conda create -n videot1 python==3.10 conda activate videot1 pip install -r requirements.txt git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]"
You need to download the following models:
- Pyramid-Flow model checkpoint (for video generation)
- VisionReward-Video model checkpoint (for video reward guidance)
- (Optional) Image-CoT-Generation model checkpoint (for ImageCoT)
- (Optional) DeepSeek-R1-Distill-Llama-8B (Or other LLMs) model checkpoint (for hierarchical prompts)
1 | cd VideoT1# Modify videot1.py to assign checkpoints correctly.python -m videot1.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard |
For inference, please refer to videot1.py for usage.
1 2 3 4 5 6 7 8 9 10 | # Import Pipeline and Base Modelfrom pyramid_flow.pyramid_dit import PyramidDiTForVideoGenerationfrom pipeline.videot1_pipeline import VideoT1Generator# Initialize Pyramid-Flow Modelpyramid_model = init_pyramid_model(model_path, device, model_variant)# Initialize VisionReward Modelreward_model, tokenizer = init_vr_model(vr_path, device)# Initialize VideoT1 Generatorgenerator = VideoT1Generator( pyramid_model, device, dtype=torch.bfloat16, image_selector_path=imgcot_path, result_path=result_path, lm_path=lm_path, )# Courtesy of Pyramid-Flow# Use the generator to generate videos using TTS strategybest_video = generator.videot1_gen( prompt=prompt, num_inference_steps=[20, 20, 20], # Inference steps for image branch at each level video_num_inference_steps=[20, 20, 20], # Inference steps for video branch at each level height=height, width=width, num_frames=temp, guidance_scale=7.0, video_guidance_scale=5.0, save_memory=True, inference_multigpu=True, video_branching_factors=video_branch, image_branching_factors=img_branch, reward_stages=reward_stages, hierarchical_prompts=True, result_path=result_path, intermediate_path=intermed_path, video_name=video_name, **reward_params ) |
Save GPU Memory by loading different models on different GPUs to avoid OOM problem.
Example: Load Reward Model in GPU0, Pyramid-Flow in GPU1 and Image-CoT model in GPU2
1 | # Load Models in different GPUspython videot1_multigpu.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard --reward_device_id 0 --base_device_id 1 --imgcot_device_id 2 --lm_device_id 3 |
Please refer to videot1_multigpu.py for multi-GPU inference.
reward_stages: Choose three indices for reward model pruning. If tree's depth is in three indices, all video clips would be fed into reward models for judging.
variant: We recommend 768p for better quality, choose from 384,768 (same as Pyramid-Flow)
img_branch: A list of integers, each correspond to the number of images at the beginning of ImageCoT process at this depth.
video_branch: A list of integers, each correspond to the number of generated next frames at this depth.
Namely, if img_branch is array, video_branch is array , then at depth , we would have initial images for each branch, and next latent frames would be the children for each branch.
We would release Dataset for Test-Time Scaling in CogVideoX-5B
We are thankful for the following great works when implementing Video-T1 and Yixin's great figure design:
Pyramid-Flow
NOVA
VisionReward
VideoLLaMA3
CogVideoX
OpenSora
Image-Generation-CoT
1 2 | @misc{liu2025videot1testtimescalingvideo, title={Video-T1: Test-Time Scaling for Video Generation}, author={Fangfu Liu and Hanyang Wang and Yimo Cai and Kaiyan Zhang and Xiaohang Zhan and Yueqi Duan}, year={2025}, eprint={2503.18942}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https: //arxiv.org/abs/2503.18942}, |
- 文章2302
- 用户1336
- 访客10952833
每个愚人节的恶作剧都提醒我们用笑声生活。