RAPO++:

Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

|arXiv|Code|

Abstract. Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present RAPO++, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In Stage 1, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. Stage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback---including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow---yielding progressively improved video generation quality. Stage 3 leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

Model Overview



RAPO++ integrates three progressive stages to enhance video quality without altering the base model:
1. Stage 1 Retrieval-Augmented Prompt Optimization (RAPO): User prompts are enriched using a relation graph that retrieves semantically related modifiers from training data. A fine-tuned LLM refactors these prompts to match the training-data style, and a discriminator selects the best candidate.
2. Stage 2 Sample-Specific Prompt Optimization (SSPO): During inference, prompts are iteratively refined in a closed feedback loop that evaluates generated videos via VLM verifiers and task-specific metrics (e.g., optical-flow or object-count checks). Feedback guides prompt rewriting for better semantic alignment, temporal coherence, and physical realism.
3. Stage 3 LLM Fine-Tuning: Optimized prompt pairs collected from Stage 2 are used to fine-tune the rewriter LLM, internalizing optimization patterns and enabling high-quality prompt generation with less test-time computation.

Physical-aware Video Generation

In this section, we demonstrate our physical-aware video generation results based on WanX2.1.

Prompt Naive | RAPO++
Pouring milk into still tea. ( Milk gracefully cascading into coffee, forming subtle, continuous wave-like ripples that blend harmoniously over time. The ripples grow gradually, maintaining a smooth and natural transition, as the milk gently merges with the coffee, mimicking real-world fluid dynamics.)
Cloth banner hanging from wooden twig facing the wind. (A cloth banner gently hangs from a wooden twig, exhibiting smooth, natural swings influenced by air currents and gravity. The banner displays subtle ripples and waves, maintaining a balanced equilibrium due to tension and gravity. The movements are realistic, with gentle oscillations in both horizontal and vertical directions, mimicking the interplay of wind and gravitational forces. The banner’s motion should smoothly transition without sudden reversals, and the amplitude and frequency of the swings should remain consistent with the described equilibrium state.)
Hand shaking salt shaker. (Hand gently shakes a salt shaker over a flat surface, initiating a smooth flow of salt grains that gradually accelerate as they overcome static friction. The grains scatter lightly, moving upward slightly before falling due to gravity, creating a natural, continuous motion pattern. The acceleration is subtle but noticeable, enhancing the realism of the grain flow.)
Peeler peels an apple. (A sharp peeler gently glides along the curved surface of the apple, systematically removing thin layers of skin. The peeler follows the fruit's contours with precision, applying minimal force to avoid tearing, resulting in smooth, even peeling.)
An electric beater whips cream in a bowl. (An electric beater gently and continuously whips cream in a bowl, incorporating air and thickening smoothly. The beater moves with a consistent rhythm, lifting cream from the bottom to the top and back, creating a gentle swirling motion with subtle, natural variations in direction and speed. The cream’s surface shows gentle frothing and a uniform texture, with occasional small pockets of foam forming and dissipating, reflecting the incorporation of air bubbles. The video captures the cream's thickening process realistically, with the beater’s motion causing slight splashes and droplets to rise and settle back into the mixture, enhancing the natural look of the whipping process.)
A waterfall cascades over jagged rocks. (A realistic waterfall cascades over jagged rocks, with water flowing smoothly downward and creating turbulent splashes as it hits the surfaces. The splashes disperse in all directions, with droplets flying upwards and sideways, mimicking the effects of gravity and fluid dynamics. Water surface tension and rock impacts are accurately portrayed, enhancing the sense of realism. The video captures natural turbulence and dispersion, with droplets forming arcs and plumes as they are pushed by air currents and gravity, ensuring a lifelike visual experience.)
A coffee pot pours a morning cup of joe. (A coffee pot smoothly pours steaming coffee into a cup, forming a steady, narrow stream guided by surface tension and viscosity. The coffee gently cascades into the cup without splashing, maintaining a consistent speed and direction due to gravity. The stream naturally wavers slightly, reflecting real-world variations in fluid dynamics, and gradually slows as it fills the cup, creating a realistic and soothing visual experience.)
A swimmer splashing in the sea water. (A swimmer moves gracefully in the sea, executing smooth arm strokes and coordinated leg kicks. The water reacts realistically, generating gentle waves and small splashes that naturally dissipate. The swimmer’s movements reflect the resistance of the water, with subtle yet noticeable ripples and splashes that enhance the sense of realism, especially during powerful kicks that create larger waves and splashes.)