Introduction
Text to Video models transform written descriptions of scenes into video. Effective prompts describe what appears in the frame and how those elements move through the scene using direct, clear language.
| Prompt: A raccoon in a plain room in zero gravity trying to steal the garbage from a silver trash can. The garbage floats out in zero gravity. Handheld documentary film style. Natural camera shake. Raw indie film aesthetic. Natural lighting. Unpolished, authentic look. Low budget realism. Observational feel. |
This guide builds on knowledge outlined in our Introduction to Prompting guide by introducing concepts specific to Text to Video, and is currently optimized for the newest Gen-4.5 model.
After completing this guide, you will understand how to create Text to Video prompts that produce videos matching your creative intent.
Related articles
Core prompt elements
Effective text to video prompts contain at least two essential elements:
- Visual descriptions — Describes what we see, where, and how it looks
- Motion descriptions — Describes how the scene moves and behaves
These two elements may encompass multiple components of video:
| Visual components | Motion components |
|
|
Do I need to include every component in my prompt?
No, you do not. Omitting certain components grants the model creative freedom to produce your video. We recommend starting with a simple prompt that focuses on the most critical visual and motion components and then adding more detail to refine as-needed.
Should I use keywords or natural language?
Both work, but natural language usually gives you more control. When you write in full sentences, you provide context that helps guide exactly how elements appear in your scene.
Keywords are useful for setting a general direction—think of them as suggestions. The model will include what you've specified, but interpretation and incorporation may vary between generations.
Prompt structure & organization
You don’t need to follow a strict formula to generate great results. Structure and order are far less important than clearly conveying an idea and reducing ambiguity.
However, establishing an organization method can assist with effectively conveying ideas and make future iteration easier. We recommend trying this structure if you’re new to generative media:
[Camera] shot of [a subject/object] [action] in [environment]. [Supporting component descriptions]
Click to view examples of different prompts following a similar structure
| Medium shot of a cowboy perched on a horse in a dusty environment. The horse rears violently, its body twisting, causing the cowboy to lose his seat and begin to fall off to the left. Backlit, western epic, cinematic, high contrast, golden hour, dusty, warm amber, deep orange, rich brown, atmospheric, dramatic backlighting, rim light, silhouette, soft glow, high contrast shadows. |
|
| POV from the perspective of a bear approaching a hollowed out tree filled with honeycomb. The bear paws reach out from the bottom of the frame and grab a handful of honey. Handheld documentary film style. Natural camera shake. Raw indie film aesthetic. Natural lighting. Unpolished, authentic look. Low budget realism. Observational feel. |
|
| A handheld low angle tracking shot, with low contrast and fast-paced motion, follows a skilled astronaut skateboarder on a moon landscape. their movements blur against the soft glow of the dark lunar environment. Film grain, low contrast, black and white |
|
For more prompt examples and their outputs, please see our Camera Terms, Prompts, & Examples.
How does using an organization structure impact iterating?
Using a consistent prompt structure makes iteration much faster and easier. Rather than hunting through scattered details to find what you need to change, you'll know exactly where each component lives.
This means you can quickly adjust specific elements without accidentally disrupting other parts of your prompt or missing details buried in the middle of a paragraph.
How long should a Text to Video prompt be?
There's no ideal prompt length—focus on clarity rather than word count. Overly long prompts can create conflicting requests and may limit the creative freedom that produces surprising, dynamic shots.
However, Text to Video prompts are typically longer than Image to Video prompts because they need to describe both the visual elements and the motion in detail.
Is the beginning of my prompt prioritized?
No. The order in which elements are introduced in a prompt do not matter.
Advanced techniques
Timestamp prompting
Timestamp prompting helps control when specific actions occur by specifying timestamps for each action. Though it may not be perfectly precise, it effectively guides the general sequence and timing.
|
A needle-felted orange and white Corgi character wearing a yellow, green, and orange sweater stands in a grocery store aisle, initially facing away from the lens. The Corgi abruptly turns its head to face the camera, triggering a rapid crash zoom directly into its shiny black bead eyes as it squints suspiciously and its woolly brow furrows deeply. The background features blurred shelves stocked with colorful red and blue products under bright, linear fluorescent ceiling lights. The lighting is soft and diffuse, highlighting the fuzzy, fibrous texture of the felted wool against the bokeh of the supermarket.
[00:00 through 00:02] looking away, then turns towards
camera |
|
|
As seen in the example above, we recommend pairing timestamp prompts with a natural language prompt for the most control.
For accuracy, consider how long an action would reasonably take to complete. For example, if you're generating a video of someone walking across a room, setting timestamps 00:00 through 00:00.5 wouldn't give enough time for the action to unfold naturally—a more realistic timeframe might be 00:00 to 00:03.
FAQ
Does Text to Video have advantages over Image to Video?
Text to Video allows you to generate footage directly without first creating an image. This works well when exact character or scene consistency isn't the priority—like creating B-roll, stock effects, or background plates.
Because Text to Video isn't constrained to a specific starting scene (like Image to Video is), it can handle complex motion sequences more effectively.How do I ensure that a visual or motion component occurs in my shot?
If a component isn't present or executed in an initial generation, try iterating your prompt to reinforce it through natural language. Iteration is a normal part of the process when working with generative media, much like the drafting phases of other creative processes.
In example, if we didn't receive a high angle when prompting
High angle of a koi fish pond for our first generation,
we would
reinforce the angle by iterating with
High angle looking down at a koi fish pond.