Video-T1: We present the generative effects and performance improvements of video generation under test-time scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.
2025.3.24
🤗🤗🤗 We release Video-T1: Test-time Scaling for Video Generation
Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models’ performance exhibits consistent improvement.
Pipeline of Test-Time Scaling for Video Generation. Top: Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate a sequence of video clips through step-by-step denoising in a linear manner, and select the highest score from the test verifiers. Bottom: Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers V to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.
git clone https://github.com/liuff19/Video-T1.git cd VideoT1 conda create -n videot1 python==3.10 conda activate videot1 pip install -r requirements.txt git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]"
You need to download the following models:
- Pyramid-Flow model checkpoint (for video generation)
- VisionReward-Video model checkpoint (for video reward guidance)
- (Optional) Image-CoT-Generation model checkpoint (for ImageCoT)
- (Optional) DeepSeek-R1-Distill-Llama-8B (Or other LLMs) model checkpoint (for hierarchical prompts)
cd VideoT1# Modify videot1.py to assign checkpoints correctly.python -m videot1.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard
For inference, please refer to videot1.py for usage.
# Import Pipeline and Base Modelfrom pyramid_flow.pyramid_dit import PyramidDiTForVideoGenerationfrom pipeline.videot1_pipeline import VideoT1Generator# Initialize Pyramid-Flow Modelpyramid_model = init_pyramid_model(model_path, device, model_variant)# Initialize VisionReward Modelreward_model, tokenizer = init_vr_model(vr_path, device)# Initialize VideoT1 Generatorgenerator = VideoT1Generator( pyramid_model, device, dtype=torch.bfloat16,
image_selector_path=imgcot_path, result_path=result_path, lm_path=lm_path,
)# Courtesy of Pyramid-Flow# Use the generator to generate videos using TTS strategybest_video = generator.videot1_gen( prompt=prompt, num_inference_steps=[20, 20, 20], # Inference steps for image branch at each level
video_num_inference_steps=[20, 20, 20], # Inference steps for video branch at each level
height=height, width=width, num_frames=temp, guidance_scale=7.0,
video_guidance_scale=5.0,
save_memory=True,
inference_multigpu=True,
video_branching_factors=video_branch, image_branching_factors=img_branch, reward_stages=reward_stages, hierarchical_prompts=True,
result_path=result_path, intermediate_path=intermed_path, video_name=video_name, **reward_params )
Save GPU Memory by loading different models on different GPUs to avoid OOM problem.
Example: Load Reward Model in GPU0, Pyramid-Flow in GPU1 and Image-CoT model in GPU2
# Load Models in different GPUspython videot1_multigpu.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard --reward_device_id 0 --base_device_id 1 --imgcot_device_id 2 --lm_device_id 3
Please refer to videot1_multigpu.py for multi-GPU inference.
reward_stages: Choose three indices for reward model pruning. If tree's depth is in three indices, all video clips would be fed into reward models for judging.
variant: We recommend 768p for better quality, choose from 384,768 (same as Pyramid-Flow)
img_branch: A list of integers, each correspond to the number of images at the beginning of ImageCoT process at this depth.
video_branch: A list of integers, each correspond to the number of generated next frames at this depth.
Namely, if img_branch is array, video_branch is array , then at depth , we would have initial images for each branch, and next latent frames would be the children for each branch.
We would release Dataset for Test-Time Scaling in CogVideoX-5B
We are thankful for the following great works when implementing Video-T1 and Yixin's great figure design:
Pyramid-Flow
NOVA
VisionReward
VideoLLaMA3
CogVideoX
OpenSora
Image-Generation-CoT
@misc{liu2025videot1testtimescalingvideo, title={Video-T1: Test-Time Scaling for Video Generation},
author={Fangfu Liu and Hanyang Wang and Yimo Cai and Kaiyan Zhang and Xiaohang Zhan and Yueqi Duan}, year={2025}, eprint={2503.18942}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.18942},
- 文章2305
- 用户1336
- 访客11455187
没有努力,天份不代表什么。
MySQL 数据库优化
This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its de
免ROOT实现模拟点击任意位置
Mobaxterm终端神器
CreateProcessW要注意的细节问题
Autonomous NAT Traversal
【教程】win10 彻底卸载edge浏览器
eclipse工程基于Xposed的一个简单Hook
排名前5的开源在线机器学习
Mac OS最简单及(Karabiner)快捷键设置
发一款C++编写的麻将
VMware NAT端口映射外网访问虚拟机linux
独家发布最新可用My-AutoPost——wordpress 采集器