Case Study: A Defensible Implementation of GenAI for Bounded Observational Tasks in Video Analysis

Case Study: A Defensible Implementation of GenAI for Bounded Observational Tasks in Video Analysis

Architects and engineers building complex systems are navigating a period of intense hype and justifiable skepticism. Engineers are being inundated with the mandate to "put AI on it," often by stakeholders who see Generative AI as a magical black box that can solve any problem. The result, more often than not, is a system that is non-deterministic, unprovable, and fundamentally untrustworthy. We see LLMs being asked to calculate physics, generate metrics from thin air, and make quantitative assessments they are architecturally incapable of performing accurately. These implementations are indefensible.

This trend creates a dangerous skepticism, leading us to believe that GenAI has no place in systems that demand precision and integrity. This is a mistake. The failure is not in the tool, but in the application. The future of robust AI systems lies not in replacing deterministic code with generative models, but in surgically integrating them to solve problems that are, paradoxically, immensely complex for traditional code to handle.

Our implementation of "handedness determination" is a case study in this approach. While it appears to be a simple query to our powerful, multimodal model, architecturally, it represents a mature and highly defensible implementation strategy.

The "Trivial" Problem That Isn't: Deterministically Coding Handedness

Consider the engineering challenge: write a deterministic function that takes a video of an athlete and returns "left_handed" or "right_handed".

Your first thought might be to use pose estimation data. A simple approach for a pitcher would be to find the hand moving fastest at the point of release. But what about a hitter? You'd need to analyze the grip. A right-handed batter places their right hand above their left. A right-handed golfer, their right hand below their left. Already, we have sport-specific rules.

This complexity explodes when the system must be agnostic to a growing library of sports and actions. The logic for a baseball_pitching_action is fundamentally different from a softball_pitching_action (overhand vs. windmill). The heuristics for a golf_full_swing_action are useless for a golf_putting_stroke. The deterministic approach would require a hard-coded, multi-level dictionary of rules, mapping each unique action_key to a specific set of anatomical heuristics and joint-velocity conditions. Every new sport or action added to our platform would necessitate a new engineering cycle to update this brittle logic, violating the principle of an agnostic analysis engine.

Now, add the complexity of camera angles.

  • Face-On View: Easy. The grip and throwing arm are clearly visible.
  • Down-the-Line / Centerfield View: The body is sideways or facing away. The hands may be occluded during the grip. The throwing arm is the one further from the camera, but is it always? What if the athlete has an unusual follow-through?
  • Edge Cases: What about switch-hitters? What about a trick-shot video?

The deterministic solution quickly devolves into a sprawling, brittle state machine. A complex web of if/else statements trying to account for every possible sport, action, camera angle, and player variation. The engineering overhead to build, test, and maintain this is enormous, and its accuracy would still be questionable.

The GenAI Solution: A Surgical Strike

Our solution bypasses this complexity entirely. In the orchestrator, after an action has been validated but before it's sent for heavy processing, we make a single, fast, and hyper-focused call to our multimodal model.

The Function: HANDEDNESS_DETERMINATION
The Context: The full video file and a JSON snippet telling the model the precise time_range of the action to analyze (e.g., "0:17 - 0:23").
The Question: "For the action in this time window, is the athlete left_handed or right_handed?"

The model, with its native understanding of video, doesn't need a complex state machine. It sees the grip, the throwing arm, the context of the sport, and the direction of motion all at once. It performs a high-level, observational task, the exact kind of task it excels at, and returns a single piece of metadata.

Why This is Architecturally Sound and Defensible

This isn't another "let the AI figure it out" implementation. It is a robust engineering solution precisely because it is bounded, verifiable, and respects architectural separation of concerns.

1. It is Bounded and Specific.
The AI's responsibility is microscopic. It is not asked to calculate peak_arm_angular_velocity. It is asked to perform a simple classification task. The output is not a complex narrative or a set of numbers; it is a single, verifiable string: "left_handed". This narrow scope dramatically reduces the "surface area" for hallucination or error.

2. It is Measurable and Verifiable.
This is the most critical point for us as engineers. We are not blindly trusting the model. We have built-in mechanisms to audit its accuracy:

  • Cross-Validation: Our later Analyzability pass also determines handedness. We can easily log any discrepancies between the two AI calls, providing an immediate, automated accuracy metric for our specialized handedness model.
  • Downstream Sanity Checks: If the handedness model fails and tells the physics engine that a left-handed pitcher is right-handed, the engine will produce nonsensical data (e.g., the "glove arm" moving at 90 mph). This would cause the physics gate to reject the action, flagging a clear failure in the pipeline that can be traced back to the initial determination. We can measure our accuracy by the absence of such failures.

3. It Respects the "Calculator vs. Observer" Paradigm.
This is the core architectural principle. Our system is a hybrid model:

  • The Calculator (Deterministic Code): The Da Vinci physics engine in the pose-estimator is a pure calculator. It takes in coordinate data and an anatomical model (handedness) and applies immutable laws of physics. It is 100% deterministic and verifiable.
  • The Observer (Generative AI): The LLM's role is to provide the configuration for the calculator. It performs the observational, context-aware task of identifying the correct anatomical model to use.

We are not using AI to calculate physics; we are using it to tell our physics engine which skeleton to load. This is a profound difference. It delegates the "fuzzy" observational task to the tool best suited for it, while reserving the "precise" mathematical task for deterministic code.

Conclusion: The Right Tool for the Right Job

The industry's obsession with using GenAI as a replacement for entire engineering stacks is misguided. Its true value in complex, data-driven systems lies in its surgical application as a specialized component.

By using a powerful multimodal model for a simple, bounded, and verifiable observational task, we eliminated a mountain of complex and brittle deterministic code. We reduced engineering overhead while simultaneously increasing the accuracy of our system's foundational assumptions. This is the hallmark of mature AI system design: not a blind faith in a black box, but a deep understanding of the tool's strengths and weaknesses, applied with surgical precision. The most effective use of our AI is as an expert observer, providing the critical context that allows our deterministic code to perform its calculations with integrity.

Back to blog