Beginner Tutorials (Prompting & Iteration) ·
The “Prompt Skeleton” You Can Reuse Forever: A 6-Field Template for Predictable AI Videos in Veo3Gen (Creators + Small Teams) (as of 2026-05-29)
A reusable 6-field AI video prompt template for Veo3Gen teams: Subject → Action → Scene → Camera → Lighting → Style, plus QA checklist + examples.
On this page
- Why most creators’ prompts feel random (and why a fixed structure helps)
- The 6-field Prompt Skeleton (copy/paste template)
- How to fill each field (without overstuffing)
- Subject: pick one clear primary focus
- Action: the core of the prompt
- Scene: state the setting and what’s in frame
- Camera: choose movement that supports the action
- Lighting: lock the mood early
- Style: define the aesthetic and tone
- Text-to-Video vs Image-to-Video: what changes in the skeleton
- Image-to-Video: use motion-first structures
- Multi-action image-to-video: sequence carefully
- A 2-minute QA checklist before you generate (to reduce rerolls)
- Micro QA checklist
- Two worked examples (same concept, rewritten with the skeleton)
- Example 1: short-form UGC-style product moment
- Example 2: cinematic b-roll establishing shot
- Three ready-to-use skeletons (copy, then swap specifics)
- 1) UGC ad (simple, direct)
- 2) Product beauty shot (clean and controlled)
- 3) Talking-head b-roll cutaway (supports narration)
- Common mistakes when using a template (and quick fixes)
- Mistake 1: adding every descriptor you can think of
- Mistake 2: mismatch between action and camera
- Mistake 3: style words with no anchors
- Mistake 4: expecting identical reruns
- FAQ
- What’s the best AI video prompt template for beginners?
- Do I need to include camera and lighting every time?
- How is image-to-video prompting different?
- Why does the same prompt produce different videos?
- Related reading
- CTA: turn this skeleton into a repeatable workflow
Why most creators’ prompts feel random (and why a fixed structure helps)
If your AI video outputs feel inconsistent, it’s often not because you “don’t know the right magic words.” It’s because your briefs are missing a stable structure.
An AI video prompt is simply a text instruction that guides a model to generate specific video content. (https://ltx.studio/blog/ai-video-prompt-guide) And for both text-to-video and image-to-video generation, a well-crafted prompt is what dictates what ends up on screen. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
The problem: most prompts are written like a stream-of-consciousness idea dump:
- subject + vibe, but no action
- action + scene, but no camera logic
- “cinematic” repeated five times, but no lighting or style anchor
A reusable “prompt skeleton” fixes this by making prompts predictable, reviewable, and team-friendly. You can standardize how you brief shots across creators, editors, and marketers—then iterate field-by-field instead of rerolling blindly.
The 6-field Prompt Skeleton (copy/paste template)
FlexClip summarizes a practical text-to-video structure as: Subject + Action + Scene + (Camera Movement + Lighting + Style). (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Here’s a Veo3Gen-friendly version you can reuse forever:
[SUBJECT]: (who/what is the focus; key descriptors)
[ACTION]: (what the subject is doing; single primary action)
[SCENE]: (where it happens; foreground/background elements)
[CAMERA]: (shot type + angle + movement; pacing)
[LIGHTING]: (time of day + quality + direction; mood)
[STYLE]: (visual aesthetic + tone + grade; realism level)
If you’re collaborating, treat these fields like a mini creative brief. When someone asks, “What should I change to get a tighter product shot?” you’ll know to adjust Camera and Lighting, not randomly rewrite the whole prompt.
How to fill each field (without overstuffing)
Subject: pick one clear primary focus
FlexClip defines Subject as the person/animal/object that’s the focus of the video. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Guidelines:
- Choose one primary subject (especially for short-form ads).
- Add only the descriptors that matter (material, color, age range, wardrobe, etc.).
Example Subject lines:
- “A matte black stainless steel water bottle with a minimal logo”
- “A tired remote worker in a hoodie, mid-30s”
Action: the core of the prompt
FlexClip calls Action the core because it drives the storyline. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Make the action concrete and filmable:
- “unscrews the lid and takes a sip”
- “pours iced coffee, condensation forming on the glass”
Scene: state the setting and what’s in frame
FlexClip defines Scene as where the action happens, including foreground/background elements. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
A useful Scene line includes:
- location + surface
- 1–3 key props
- background vibe (clean studio, messy kitchen, city street)
Camera: choose movement that supports the action
FlexClip describes camera movement as shot type/angle/movement that adds narrative and visual appeal—and movements can be combined. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Also, professional-level prompts can specify camera angles, lighting conditions, motion patterns, composition details, and stylistic choices. (https://ltx.studio/blog/ai-video-prompt-guide)
Keep it simple and aligned:
- If the action is “opening a box,” use a top-down or close-up with a slow push-in.
- If the action is “walking into sunlight,” use a tracking move.
Lighting: lock the mood early
Lighting can significantly impact mood and depth; FlexClip lists examples like warm light, morning light, spotlight, and backlighting. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Pick one primary lighting idea:
- “soft morning window light, gentle shadows”
- “dramatic backlight with rim highlights”
Style: define the aesthetic and tone
Style sets tone/mood and can include visual style and emotional tone. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
This is where you decide:
- realism vs. illustration
- grade (warm, cool, neutral)
- energy (calm, punchy, moody)
Text-to-Video vs Image-to-Video: what changes in the skeleton
Text-to-video prompts benefit from the full 6 fields because you’re creating everything from scratch.
Image-to-video is different: the image already “locks” many details (subject design, wardrobe, composition), so you usually emphasize motion.
Image-to-Video: use motion-first structures
FlexClip gives an image-to-video single-action structure:
Subject + Action + Background + Background Movement + Camera Movement. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
And it describes Background Movement as subtle or dynamic environmental shifts that bring the scene to life. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Eachlabs also recommends being clear about what you want to see—subject, what they’re doing, the setting, and overall mood. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
How to adapt the 6 fields for image-to-video:
- Subject: keep short (“same subject as reference image”).
- Scene: often becomes Background (“keep background consistent”).
- Add Background Movement explicitly.
- Lighting/Style: only include if you want a change; otherwise keep minimal to avoid fighting the reference.
Multi-action image-to-video: sequence carefully
FlexClip also lists multi-action structures like:
- Subject 1 + Action 1 + Action 2 (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
- Subject 1 + Action 1 + Subject 2 + Action 2 ... (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
When you add actions, keep them chronological and physically compatible with the same camera setup.
A 2-minute QA checklist before you generate (to reduce rerolls)
Eachlabs warns that too few details can cause the AI to guess wrong, while too many can confuse it. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
Use this quick QA pass to keep prompts tight:
Micro QA checklist
- Remove contradictions (e.g., “midday sun” + “night scene”).
- Keep one primary subject (or clearly label Subject 1/2).
- Align camera movement with the action (tracking for walking, push-in for reveal).
- Lock mood/style terms to 1–2 clear anchors.
- Trim extras that don’t change the shot (avoid overstuffing).
Also remember: even with the same prompt, results can vary because each generation is a new interpretation. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
Two worked examples (same concept, rewritten with the skeleton)
Below are two “before → after” rewrites to show how the skeleton turns vibes into a shot plan.
Example 1: short-form UGC-style product moment
Vague prompt (hard to iterate):
“Make a TikTok ad about a water bottle that looks premium and refreshing, cinematic, nice lighting.”
Skeleton version (easier to tweak):
[SUBJECT]: A matte black stainless steel water bottle with a minimal white logo, cold to the touch
[ACTION]: A hand unscrews the lid; condensation beads; the bottle tilts as water pours into a clear glass with ice
[SCENE]: Clean kitchen counter, light stone surface, a lemon wedge and folded towel in the foreground, soft blurred background
[CAMERA]: Close-up product framing, slight top-down angle, slow push-in during the pour
[LIGHTING]: Soft morning window light from the side, gentle shadows, subtle highlights on metal
[STYLE]: Realistic, crisp commercial look, calm and refreshing mood, neutral color grade
How to iterate: if the bottle looks dull, adjust Lighting (“add stronger specular highlights”) and Camera (“rotate to catch reflections”)—don’t rewrite the subject or scene.
Example 2: cinematic b-roll establishing shot
Vague prompt (pretty, but undefined):
“Cinematic b-roll of a person walking in the city at night, moody and cool.”
Skeleton version (shot-by-shot clarity):
[SUBJECT]: A lone pedestrian in a dark coat, face partially obscured, walking with purpose
[ACTION]: Walks past storefront reflections; briefly glances to the side; exhales visible breath
[SCENE]: Rain-damp city sidewalk at night, neon signs reflected in puddles, light traffic bokeh in the distance
[CAMERA]: Medium shot from behind, slow tracking follow, slight handheld feel for natural motion
[LIGHTING]: Neon practicals and streetlights, backlight rim on the coat, high contrast with deep shadows
[STYLE]: Realistic cinematic tone, cool color palette, moody atmosphere
This version makes your intent testable: if the shot feels too frantic, slow the camera move; if it feels flat, adjust backlight and contrast.
Three ready-to-use skeletons (copy, then swap specifics)
1) UGC ad (simple, direct)
[SUBJECT]: (product) held by (person descriptor)
[ACTION]: demonstrates one key benefit in a single motion
[SCENE]: everyday setting that matches the audience use-case
[CAMERA]: handheld close-up, quick reframing to the benefit moment
[LIGHTING]: soft natural indoor light
[STYLE]: realistic, casual, friendly tone
2) Product beauty shot (clean and controlled)
[SUBJECT]: (product) with material/color details
[ACTION]: slow rotation or reveal; one hero moment
[SCENE]: minimal studio surface with 1–2 complementary props
[CAMERA]: macro/close-up, slow slider push-in, steady composition
[LIGHTING]: spotlight or softbox-style highlights; controlled shadows
[STYLE]: premium commercial look, clean grade
3) Talking-head b-roll cutaway (supports narration)
[SUBJECT]: hands + laptop/phone/notebook (or the environment as the subject)
[ACTION]: one loopable action (typing, scrolling, writing)
[SCENE]: desk setup with a few readable elements
[CAMERA]: over-the-shoulder or top-down; gentle movement
[LIGHTING]: warm desk lamp + subtle ambient fill
[STYLE]: realistic, documentary-lite, unobtrusive
Common mistakes when using a template (and quick fixes)
Mistake 1: adding every descriptor you can think of
If you stack too many constraints, you may confuse the model; too few can force it to guess. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
Fix: keep each field to one sentence (two max), and only include details that change the viewer’s understanding of the shot.
Mistake 2: mismatch between action and camera
A “fast sprint” paired with a “locked-off static wide shot” can be fine—but if your intent is intensity, the camera should support it.
Fix: rewrite Camera after you finalize Action, not before.
Mistake 3: style words with no anchors
“Cinematic” alone is vague. Guidance for professional-level prompts includes specifying camera angles, lighting, motion patterns, composition, and stylistic choices. (https://ltx.studio/blog/ai-video-prompt-guide)
Fix: pick 1–2 concrete style anchors (realistic vs. anime; warm vs. cool; high contrast vs. soft).
Mistake 4: expecting identical reruns
Even with the same prompt, each generation can differ. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
Fix: iterate in small steps and change one field at a time so you learn what actually moved the output.
FAQ
What’s the best AI video prompt template for beginners?
A simple, consistent structure like Subject → Action → Scene → Camera → Lighting → Style keeps prompts readable and makes iteration easier. FlexClip presents a similar formula for text-to-video prompts. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Do I need to include camera and lighting every time?
Not always—but adding specifics like camera angles and lighting conditions is part of what professional-level prompts can include. (https://ltx.studio/blog/ai-video-prompt-guide) If you’re getting “almost right” results, camera + lighting are often the easiest fields to refine.
How is image-to-video prompting different?
Image-to-video is usually motion-focused. FlexClip shares an image-to-video structure that emphasizes background movement and camera movement. (https://help.flexclip.com/en/articles/10326783-how-to-write-effective-text-prompts-to-generate-ai-videos)
Why does the same prompt produce different videos?
Because each generation is a new interpretation; you shouldn’t assume identical results across runs. (https://www.eachlabs.ai/blog/image-to-video-prompt-guide-best-practices-for-realistic-results)
Related reading
CTA: turn this skeleton into a repeatable workflow
If you want to generate videos programmatically and standardize prompts across a small team, explore the Veo3Gen endpoints on the API page. When you’re ready to estimate usage and scale up, see pricing to pick a plan that fits your iteration rhythm.
Try Veo 3 & Veo 3 API for Free
Experience cinematic AI video generation at the industry's lowest price point. No credit card required to start.