Infrastructure · 2024 — present

Autonomous Dataset Agent for HDR Training Data

The HDR generation work is built on a training set of perspective views extracted from 360° Latlong HDR panoramas. The mechanical part of that extraction — rotating a panorama, sampling a perspective view, writing it to disk — is a few lines of code. The hard part is deciding which extracted views are worth including in the training set, and that is a problem no script can solve. It is a vision-and-judgment problem at the scale of hundreds of thousands of candidate frames. This is the system that does it: an autonomous agent that handles the entire extraction-and-curation loop without supervision, makes the judgment call on each view using a local vision-language model, and runs overnight on the same workstation that trains the diffusion model.

01 — The problem

A useful HDR training set requires diversity. A single 360° panorama, properly sampled, can yield thirty or more usable perspective views — different yaw, pitch, and field-of-view combinations capture genuinely different content, lighting, and composition from the same source frame. With roughly nine thousand source EXR files in the dataset, the upper bound on candidate views is in the hundreds of thousands.

Not all of them are usable. A perspective sampled near the poles of the Latlong projection collapses into recognizable distortion. A view cropped through a watermark, a signpost, a license plate, or any of the small content-rights problems that show up in real datasets has to be excluded. A view whose composition has the subject sliced through the middle of the frame is technically sharp but training-toxic. None of these failures is detectable by a sharpness or exposure heuristic; they require a judgment about what's in the image.

The honest manual approach is to put a human reviewer in front of every candidate and have them accept or reject. A senior compositor can do this competently for about an hour. After three hours, the rejection criteria start drifting. After eight, the reviewer is no longer making the same call on the same image. The decision becomes inconsistent in ways that are difficult to detect during review and impossible to correct after the fact, because the failures show up as quiet biases in the eventual model rather than as a flag on any individual frame.

The choice was either to hire a small team and accept the consistency problem, or to build a system that doesn't get tired.

A human reviewer is excellent for the first hour and unreliable by the eighth. The agent's ten-thousandth evaluation is identical in quality to its first. That is not a statement about intelligence. It is a statement about fatigue.

02 — Why this isn't a heuristic problem

The first attempt was a pipeline of classical metrics — Laplacian variance for sharpness, histogram analysis for exposure, edge density for compositional balance, perceptual hashing for duplicate detection. Together these eliminate maybe forty percent of bad views. They miss almost everything that actually matters.

The remaining failures are content-type judgments. Is this image dominated by a single recognizable brand logo? Does the framing crop a person's face in half? Is the horizon line tilted in a way that suggests an unintended capture rather than an intentional Dutch angle? Are the polar regions of the source panorama visible enough to introduce projection artifacts? These questions resist heuristics. A reviewer answers them by looking. Any system that answers them at scale has to look too.

This is the point at which a vision-language model becomes the natural tool. Not because the task requires anything close to general intelligence — it doesn't — but because it requires the specific kind of pattern-matching that humans do effortlessly and that classical computer vision does poorly.

03 — The vision-and-tools problem

The agent has two requirements. It has to look at an image and form an opinion about it. It also has to operate the panorama-extraction pipeline programmatically — rotate the source frame, extract a perspective sample, hash the result, write the file, advance to the next iteration — which means it has to reliably call functions.

These two capabilities, in 2024, were almost mutually exclusive at the model size class that runs locally. Quantized vision-language models that handled images well tended to lose function-calling reliability somewhere in the quantization process. Quantized models with strong tool use tended to lose vision. Several rounds of testing — AWQ versions of one family, GGUF versions of another, fp8 of a third — produced either a model that could see but couldn't operate the tools, or one that could operate the tools but ignored the image input.

The combination that worked was Mistral Small 3.2 in 8-bit quantization, served via vLLM. Vision is solid. Function calling is reliable across long agent traces. Throughput on a single RTX 5090, with proper concurrency handling, settles around 1,700 tokens per second, which is enough to process the full candidate set for a typical training run overnight rather than across a working week. The whole stack runs locally — same workstation that trains the diffusion model, same workstation that runs ComfyUI. No API calls, no rate limits, no per-token cost on a quarter-million-evaluation pass. The fp8 and AWQ failures were not catastrophic — they would correctly call tools most of the time, and correctly evaluate images most of the time. But "most of the time" across a quarter-million evaluations means several thousand silently-wrong decisions, which is the failure mode I was specifically trying to engineer out.

04 — The agent loop

The agent itself is a LangGraph state machine wrapping the model with a small set of custom tools. The tool surface is deliberately narrow — fewer choices means fewer failure modes:

Rotate panorama. Apply a yaw and pitch transform to the working Latlong frame.
Extract perspective. Sample a perspective view from the rotated panorama at a given field of view, returning the result as an in-memory image.
Evaluate view. The vision call. Pass the extracted view to the model with a structured prompt asking for a usable-for-training judgment along with a brief reason. The model returns a verdict and a short rationale.
Check duplicate. Compare a perceptual hash of the candidate against the set of already-exported views to prevent near-duplicates from accumulating.
Export. Write the candidate to disk as a 16-bit linear EXR with metadata recording the source panorama, the rotation parameters, and the model's rationale for inclusion.
Reject and continue. Advance the state without exporting and log the rejection reason.

For each source panorama, the agent iterates through a coarse rotation grid, evaluates each candidate, runs the duplicate check on accepted views, and proceeds. The iteration is autonomous — the agent decides, per source frame, when it has extracted enough usable diversity to move on. There is no human-in-the-loop after the launch.

The structured-prompt design for the evaluation call took several iterations to land. Early prompts asked the model for a yes/no decision and got compliant but uninformative responses — the model would correctly accept good views and correctly reject obvious failures, but the borderline cases got coin-flipped without consistency. Adding a rationale requirement, and a constraint that the rationale name a specific failure mode for any rejection, dramatically improved consistency on the borderline cases. The model is, in effect, forced to commit to a reason before committing to a decision.

05 — What changed

The first end-to-end run on the production dataset processed the equivalent of three working weeks of manual review in a single overnight session, on the workstation that was otherwise idle. The accepted-view set, on subsequent spot-checking, was tighter and more consistent than any human-reviewed sample I had previously assembled — not because the model is making smarter decisions, but because it is making the same decisions throughout the run.

The numbers, against the manual baseline:

Metric	Manual review	Agent
Throughput (candidate views/hr)	~250 (sustainable)	~25,000
Decision drift across run	Substantial	None
Per-decision cost	Time	~0 (local inference)
Annotation consistency	Best-effort	Deterministic at temperature 0
Duplicate-set growth	Manual hash check	Automatic

The throughput number tells one story. The "decision drift" row tells the more important one. The system's value is not that it is fast — though it is fast — but that the ten-thousandth decision is made under exactly the same criteria as the first.

06 — Reflection

The unexpected lesson from this project was not technical. It was that a substantial fraction of the work labelled "tedious manual" in VFX pipelines is not tedious because it is mechanical. It is tedious because it is judgment under fatigue — work that a human can do well in short bursts but that loses quality across the duration required to do it at scale. Curating a training set, tagging shots for the comp department, sifting reference plates from a shoot for usable elements — these are not boring tasks. They are tasks where the boring part is the consistency requirement, not the work itself.

The agent does not replace a senior compositor's judgment about what makes a good frame. The agent replaces the part of the senior compositor's day that was always going to be done worse by the eighth hour than the first. There is a meaningful distinction between those two things, and conflating them is what makes most "AI replaces creative roles" framing wrong in both directions.

What is still missing is the next step up — an agent that not only curates an existing source set but reasons about what kinds of additional source material the training set is short on, and queues a separate scraping or shooting plan. That work has started in a small way, but it is closer to a research direction than a shipped tool. For now, the system is sufficient for what the HDR generation project needs: a curated, diverse, consistently-judged training set that the diffusion model can actually use.

Next case study

ComfyUI MCP Server

Exposing the full image-generation stack to coding agents via Model Context Protocol.

← Back to Selected Research