Home Services AI Reel Portfolio Tools About Contact Download CV

AI Video Model Evaluation

Emmy-Level Visual Judgment for Frontier AI

I bring 20+ years of broadcast-level editorial judgment to AI video model evaluation — ranking outputs, calibrating ground truth, and providing human feedback that makes generative models better.

Evaluation experience with:
Kling Runway Sora Veo WAN 2.1

What I Evaluate

Nine dimensions of generative video quality — assessed with the frame-level precision of a broadcast editor who has spent 20 years knowing exactly when motion fails.

🎣

Temporal Coherence

Does motion flow smoothly frame-to-frame? Detecting stuttering, warping, and temporal inconsistencies that break the viewer's sense of continuity.

🎯

Prompt Adherence

Does the output match the prompt? Evaluating subject, action, style, mood, and compositional accuracy at every level of semantic granularity.

Motion Realism

Do objects, people, and physics move naturally? Assessing physics plausibility, weight, momentum, and the subtle cues that signal real-world dynamics.

💡

Lighting Consistency

Are light sources, shadows, and exposure consistent throughout the clip? Flagging lighting drift, impossible shadows, and exposure discontinuities.

🎬

Camera Language

Shot composition, framing, camera movement, and cinematic grammar — evaluated by a working editor who understands why every shot choice matters.

⚠️

Visual Artifacts

Detecting flickering, morphing, blurs, compression issues, and generation failures — the full taxonomy of ways AI video models visibly break down.

👤

Character Continuity

Identity, clothing, and feature consistency across frames. Spotting identity drift, morphing faces, and costume changes that undermine narrative coherence.

✂️

Editorial Logic

Does the clip have narrative structure, pacing, and editorial intent? Evaluating whether the output reads as purposeful cinema or random generation.

⚖️

Comparative Model Ranking

Side-by-side A/B evaluation and multi-model ranking across all of the above dimensions — generating structured preference data for RLHF training pipelines.


How I Work

Structured evaluation workflows built for scale — from single-clip annotation to large-batch dataset QA.

📋

Rubric-Based Scoring

Structured evaluation frameworks with defined criteria and weighted scores for consistency across large datasets. Every judgment is traceable, reproducible, and calibration-ready.

🔄

RLHF-Style Ranking

Human preference feedback through pairwise comparisons, preference selection, and model output ranking — generating clean, consistent signal for reinforcement learning from human feedback pipelines.

🔍

Failure Mode Tagging

Systematic identification and categorization of generation failures for training signal improvement — turning model breakdowns into structured, actionable annotation data.

🔗

Prompt-Output Alignment

Evaluating semantic adherence between input prompts and generated video at multiple granularity levels — from broad intent to frame-level compositional accuracy.

🎯

Ground Truth Calibration

Establishing reference benchmarks and calibrating evaluation standards for training data QA — ensuring annotator consistency at the start of every engagement.

Quality Assurance Workflows

End-to-end dataset review, batch QA, and quality control for AI training pipelines — from individual clip annotation to full-batch consistency audits.


Why My Background Matters

Editorial intuition built over decades of broadcast production — applied to the hardest problem in generative AI: knowing exactly when video looks wrong.

20+
Years Editing
50K+
Assets Evaluated
Emmy Winner
4
Olympic Games
📷

Broadcast-Trained Eye

20+ years cutting elite sports content for Olympic Channel, ESPN, Sky Sports, and BT Sport. I know what good motion looks like at the frame level — and exactly when AI gets it wrong. That precision is rare, and it's what makes evaluation data reliable.

🏆

2× Sports Emmy Winner · Judge Since 2017

Two Sports Emmy Awards, plus serving as an Emmy judge since 2017. The same attention to timing, composition, and visual continuity that wins awards is exactly what I apply to evaluation — the bar isn't "does it look okay," it's "does it look right."

🤖

AI Tools Expertise

Active user of Kling, Runway, Veo, WAN 2.1, Sora, and other generative video models. I understand how these systems fail and why — which means I can identify the failure modes that matter for training, not just flag anything that looks odd.

🎓

Certified AI Specialist

Certified AI Video Specialist and Senior AI Evaluation Lead with experience across annotation platforms, RLHF workflows, and model feedback systems — bridging the gap between editorial instinct and structured data pipelines.


Target Domains

Organizations building, training, or evaluating generative video at the frontier.

Generative Video Models

Runway Sora / OpenAI Kling Veo / Google DeepMind WAN 2.1

Foundation Model Labs

xAI OpenAI Anthropic Google DeepMind Meta AI

Data & Annotation Platforms

Scale AI Mercor Outlier Appen Surge AI

AI Production & Research

AI Production Tools Research Institutions Academic Labs

📷

Ready to Improve Your
Video Model?

I'm available for evaluation projects, annotation contracts, and human feedback engagements — short-term and ongoing.

marc.warfield@dynamicvibe.net