The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Bingjie Gao1,2, Xinyu Gao1,2, Xiaoxue Wu3,2, Yujie Zhou1,2,
Yu Qiao2,✉, Li Niu1,✉, Xinyuan Chen2,✉, Yaohui Wang2,✉

1 Shanghai Jiao Tong University, 2 Shanghai Artificial Intelligence Laboratory, 3 Fudan University

Abstract

Text-to-video (T2V) generative models trained on large-scale datasets have made remarkable advancements. T2V generative models are sensitive to input prompts and inappropriate prompt design may even worsen the generative results. Previous studies utilize large language models (LLMs) to optimize user-provided prompts consistent with distribution of training prompts directly, lack of refined guidance considering both prompt vocabulary and specific sentence format. To this end, we introduce a retrieval-augmented prompt optimization (RAPO) framework. In order to address potential inaccuracies and ambiguous details generated by LLMs, RAPO optimizes the naive prompt from two branches and selects the better one for T2V generation. For the first branch, the user-provided prompts are augmented with relevant and diverse modifiers retrieved from a built relation graph, and then refactored into the format of training prompts through a fine-tuned LLM. For the second branch, the naive prompt is directly rewritten by a frozen LLM with the well-designed instruction. Extensive experiments demonstrate that our proposed method can effectively enhance the both static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.

Method

MY ALT TEXT

The naive prompt is optimized by two branches respectively. For the first branch, it is enriched by the word augmentation module based on a constructed relation graph and a frozen Large Language Model (LLM). Subsequently, augmented prompt is refactored by a finetuned LLM into a specific format in the sentence refactoring module. For the second branch, the naive prompt is directly rewritten by a frozen LLM. Finally, the prompt selection module selects the better one from two branches' results as input for T2V model.

Results

Qualitative comparisons using LaVie with initial prompts (top) and optimized prompts from RAPO (below) as shown below.