Evaluation Criteria
Nine dimensions of generative video quality — assessed with the frame-level precision of a broadcast editor who has spent 20 years knowing exactly when motion fails.
Does motion flow smoothly frame-to-frame? Detecting stuttering, warping, and temporal inconsistencies that break the viewer's sense of continuity.
Does the output match the prompt? Evaluating subject, action, style, mood, and compositional accuracy at every level of semantic granularity.
Do objects, people, and physics move naturally? Assessing physics plausibility, weight, momentum, and the subtle cues that signal real-world dynamics.
Are light sources, shadows, and exposure consistent throughout the clip? Flagging lighting drift, impossible shadows, and exposure discontinuities.
Shot composition, framing, camera movement, and cinematic grammar — evaluated by a working editor who understands why every shot choice matters.
Detecting flickering, morphing, blurs, compression issues, and generation failures — the full taxonomy of ways AI video models visibly break down.
Identity, clothing, and feature consistency across frames. Spotting identity drift, morphing faces, and costume changes that undermine narrative coherence.
Does the clip have narrative structure, pacing, and editorial intent? Evaluating whether the output reads as purposeful cinema or random generation.
Side-by-side A/B evaluation and multi-model ranking across all of the above dimensions — generating structured preference data for RLHF training pipelines.
Methodology
Structured evaluation workflows built for scale — from single-clip annotation to large-batch dataset QA.
Structured evaluation frameworks with defined criteria and weighted scores for consistency across large datasets. Every judgment is traceable, reproducible, and calibration-ready.
Human preference feedback through pairwise comparisons, preference selection, and model output ranking — generating clean, consistent signal for reinforcement learning from human feedback pipelines.
Systematic identification and categorization of generation failures for training signal improvement — turning model breakdowns into structured, actionable annotation data.
Evaluating semantic adherence between input prompts and generated video at multiple granularity levels — from broad intent to frame-level compositional accuracy.
Establishing reference benchmarks and calibrating evaluation standards for training data QA — ensuring annotator consistency at the start of every engagement.
End-to-end dataset review, batch QA, and quality control for AI training pipelines — from individual clip annotation to full-batch consistency audits.
Credentials
Editorial intuition built over decades of broadcast production — applied to the hardest problem in generative AI: knowing exactly when video looks wrong.
20+ years cutting elite sports content for Olympic Channel, ESPN, Sky Sports, and BT Sport. I know what good motion looks like at the frame level — and exactly when AI gets it wrong. That precision is rare, and it's what makes evaluation data reliable.
Two Sports Emmy Awards, plus serving as an Emmy judge since 2017. The same attention to timing, composition, and visual continuity that wins awards is exactly what I apply to evaluation — the bar isn't "does it look okay," it's "does it look right."
Active user of Kling, Runway, Veo, WAN 2.1, Sora, and other generative video models. I understand how these systems fail and why — which means I can identify the failure modes that matter for training, not just flag anything that looks odd.
Certified AI Video Specialist and Senior AI Evaluation Lead with experience across annotation platforms, RLHF workflows, and model feedback systems — bridging the gap between editorial instinct and structured data pipelines.
Relevant For
Organizations building, training, or evaluating generative video at the frontier.
Category 01
Category 02
Category 03
Category 04
I'm available for evaluation projects, annotation contracts, and human feedback engagements — short-term and ongoing.
marc.warfield@dynamicvibe.net