Methodology · 2024 — present

Custom LoRA Training Pipelines

Most public discussion of "LoRA training" treats it as a configuration problem. Pick a rank. Pick target modules. Run the trainer. The actual interesting work — the reason a generic LoRA pipeline produces a generic LoRA and a custom one produces a model that does something useful — sits almost entirely in the loss function and the data pipeline. This page is a set of notes on the patterns that have held up across several projects: training Flux to produce HDR linear EXRs, training Wan 2.1 to reconstruct clipped highlights from cinema footage, and training Qwen-Image as an SDR-to-HDR mapper, plus the infrastructure work that surrounds all three.

01 — The thesis

The architecture of a LoRA is largely settled. Rank, alpha, target modules — there is a defensible default for each, and most of the interesting capacity gain comes from picking the right base model rather than tuning the adapter. The dataset is harder, but for a given task it is a finite curation problem with diminishing returns past a certain point. What does not converge to a default — and what most public LoRA training treats as solved when it is not — is the objective. The loss function is where the work lives.

A diffusion model trained with vanilla MSE on noise prediction will learn to imitate the distribution it is shown. That is fine if the distribution is what you want. It is rarely what I want. Training a Flux LoRA to generate scene-referred linear HDR is not a distribution-imitation problem; it is a problem of teaching the model the specific mathematical structure of a logarithmic encoding curve and the perceptual non-uniformity of error in highlight regions. Training Wan 2.1 to reconstruct HDR from clipped LDR is not "give the model more examples"; it is finding the loss terms that make the model care more about the recovered values above scene-reference white than about the rest of the frame. The architecture is the same in both cases. The base model is different. The dataset is different. But the largest design surface in either project is the loss function.

This page documents the patterns that have produced working models across that surface.

02 — Color science as the primary axis

The recurring observation across nearly every LoRA I have shipped is that a problem framed as "the model is not generating HDR / HDR-adjacent output correctly" is, on inspection, a problem of what space the loss is computed in. Computing MSE in linear space biases learning toward midtones because that is where most of the pixel mass lives. Computing it in display-encoded sRGB introduces a gamma curve that is wrong for HDR content. Computing it in a quantized 8-bit space discards exactly the precision the model is being trained to recover.

The pattern that has worked is to define the loss in a logarithmic working space that approximately mirrors human perceptual sensitivity to luminance, with the choice of curve matched to the dynamic range the model needs to learn. ARRI LogC4 is the obvious default for cinema-adjacent work — seventeen stops of dynamic range, well-specified, widely supported by professional tooling, and matched to the camera ecosystem the trained models will eventually integrate with. For training tasks where LogC4's range is more than the dataset can fill, narrower custom curves with the same mathematical shape but tighter latitude are sometimes the right choice. The encoding is always reversible, monotonic, and continuous, with a small linear segment near black for numerical stability.

Inside the loss itself, I have repeatedly found that decomposing the objective into named components — each enforcing a specific property of the target encoding — produces meaningfully better results than a single L1 or LPIPS term. Three components recur:

Curve fidelity. A standard pixel-wise loss in the encoded space rather than linear space, so error contribution is roughly perceptually uniform rather than dominated by midtones.
Stop consistency. A regularizer that penalizes deviation from the encoding's defining property — that a one-stop change in scene luminance should produce a fixed delta in the encoded value across the midtone range. This term goes to zero on a perfect log-encoded output and grows as the model's output deviates from the curve. It is, in effect, a structural prior on the answer.
Anchor. A small term that penalizes deviation at one or two specific reference values — middle grey, scene-referred white, the upper black cutoff. This eliminates the most common drift failures, where the global tonal response is approximately right but the absolute placement is wrong.

Together these are what I think of as the "log family" of losses. Used standalone they are insufficient; layered with the standard flow-matching or noise-prediction objective they give the model the structural guidance the training distribution alone cannot.

The architecture of a LoRA is mostly a settled question. The loss function is where almost all of the interesting work lives. The model is taught what kind of answer is good; everything else is implementation.

03 — Composite, tiered losses

The second pattern is that good loss functions are layered and conditional. A typical training run for the HDR work uses something like five distinct loss terms, each contributing on different timesteps, at different resolutions, or on different subsets of the data.

The composition that has worked best has the following shape. The base term is always flow-matching MSE on the predicted velocity, weighted by a per-timestep schedule that compensates for signal-to-noise ratio variance — debiased or min-SNR weighting depending on the timestep sampler. On top of that, a representation-alignment term — REPA, from Yu et al. 2024 — projects intermediate transformer hidden states into the embedding space of a frozen self-supervised vision model (DINOv2 in practice) and aligns them with the conditioning frame's features. This single term has been more important to convergence speed than any other single change I have tried; published results from the original paper showed convergence speedups of an order of magnitude, and my own runs broadly replicate that.

A pixel-space term is computed only on a subset of timesteps — typically those with low noise where decoding is meaningful — by VAE-decoding the predicted x0 and comparing to the ground truth in encoded pixel space. This term gets an additional highlight-weighting multiplier so that errors in the brightest regions, which the global loss otherwise underweights, contribute a fair share of the gradient. The cost of the extra VAE decode is amortized by computing this term every Nth step rather than every step. Two perceptual terms — LPIPS and a DINOv2-on-decoded-output similarity — are available but typically held off until the run plateaus on the cheaper terms; engaging them earlier wastes compute on regions of the loss landscape the cheaper terms can already navigate.

Each term is a knob, with a configured weight that defaults to a tuned value for the typical run and can be set to zero to disable cleanly. The whole composite is implemented as a single pluggable module that can be swapped between training scripts without changing the training loop. Several of these are publicly visible in the wan-iclora-hdr and wan-vace-hdr-convert configurations.

04 — Inventing losses when the standard set isn't enough

For some problems the right loss does not exist in the literature and has to be built from the problem definition. Two examples worth documenting.

Multi-character coherence. Training a single LoRA to render multiple distinct characters consistently in the same image is a well-known failure mode of standard LoRA training — characters bleed into each other, share features, or one of them quietly disappears in scenes that should contain both. The fix is a small family of attention-based losses applied alongside the diffusion objective. A concept separation term, implemented as a contrastive or InfoNCE loss on the per-character text-conditioning embeddings, pushes the embeddings apart in feature space. An attention overlap term penalizes spatial co-occurrence of attention mass between different character tokens, since overlapping attention is a leading indicator of character bleeding. An attention regularization term encourages each character's attention to be both sparse (not spread across the whole frame) and concentrated (entropy-minimizing within its support region). And a concept presence term penalizes runs in which one of the characters' attention drops below a threshold — the failure mode where a character vanishes entirely. These are based on patterns from the recent multi-concept literature (CLoRA, Mix-of-Show, AttenCraft) but adapted into a single composable extension that drops into existing training scripts.

HDR-specific perceptual error. Standard LPIPS is trained on display-referred 8-bit imagery and underweights error in scene-referred regions above 1.0 — exactly the regions an HDR LoRA is trying to learn. The fix here is a perceptual term computed in PU21 space (the gfxdisp/pu21 banding-aware encoding for HDR), which provides an HDR-aware alternative to LPIPS that responds correctly to the kind of error that shows up in highlight reconstruction. The training loop computes this term selectively on low-noise timesteps where decoded output is meaningful, similar to the LPIPS gating above.

The general principle behind both: when the standard losses don't penalize the failure mode that matters, a small targeted loss term is almost always better than more data or a longer schedule. Inventing the loss is the work.

05 — The discipline around the training itself

Loss design is the largest open surface, but training infrastructure absorbs almost as much engineering. A few practices I now treat as non-negotiable.

Validation gates before any long run. Every project gets a sequence of small smoke tests that have to pass before any real training is launched: a VAE round-trip test (does encoding and decoding through the model's VAE preserve the dynamic range the loss is computed against), a dataset-visualization sanity check (do the SDR/HDR pairs look right when rendered side-by-side), a five-step training smoke test (does forward + backward + checkpointing actually work on a single batch), and a gradient-presence assertion (are gradients flowing only to the LoRA parameters, not leaking into the frozen base model). These take an hour to set up per project and have saved me from at least one full week of GPU time per project across the work documented here. The Qwen-Image-2512 HDR LoRA work documents the four-gate sequence explicitly in its README.

Explicit memory planning. For every training run, I write down the VRAM budget by component before launching: transformer in fp8, LoRA in bf16, VAE, optimizer state, activations and gradients at the chosen resolution, perceptual model overhead, plus a 10–15% safety margin. The plan then tells me what to drop first if I OOM, in priority order. This converts what is otherwise a frustrating cycle of crash → reduce-something-arbitrary → retry into a deliberate and reversible decision tree.

Cached, deterministic dataset prep. EXR decode at training time is several times slower than reading a precomputed fp16 numpy array. For runs longer than a few hours, the input pipeline is the bottleneck. The fix is to do the decode and any deterministic preprocessing once, write the result to disk as fp16 numpy, and read from there during training. The Qwen-Image LoRA work, which trains on roughly 1,200 HDR EXRs, sees a 5–10x throughput increase from this alone.

Best-checkpoint tracking by validation loss. Running long training and saving every Nth step is standard. The detail that took me too long to add: also maintaining a separate best/ directory that contains a copy of whichever checkpoint had the lowest validation loss seen so far, with a small best.json pointer file recording the step and metric value. Inference always points at best/ rather than latest/. This eliminates the most common cause of "the LoRA was working two days ago and now it isn't" — running past the optimum and saving regressions on top of the best result.

These are not novel ideas individually. They are, collectively, the difference between training runs that finish and produce a useful model and training runs that finish and produce a model that almost works.

06 — Where this has been applied

The methodology has been used across several projects, with varying levels of public release. Briefly:

The HDR generation work on FLUX.2 is the most-developed application — both the LogC4 trojan-horse approach and the eventual native-linear VAE-and-DiT fine-tuning use the patterns above. The bit-depth expansion network uses the composite-loss methodology in a non-LoRA, from-scratch context, and is the most aggressive application of the "invent the loss for the problem" principle.

For HDR reconstruction — taking already-clipped LDR cinema footage and recovering the highlight detail — I have run two parallel pipelines on Wan 2.1: a VACE-conditioned mask-and-detail-merge approach with a public training repo and inference GUI (wan-vace-hdr-convert, vace_2.1_hdr_convert_tool), and a parallel IC-LoRA experiment (wan-iclora-hdr) testing whether whole-frame conditioning can match the more conservative VFX-safe pipeline. Running both in parallel was deliberate: the same problem, two architecturally different training setups, evaluated against the same held-out set.

The Qwen-Image-2512 IC-LoRA work applies the same family of techniques to a different base model, with explicit validation gates, fp8 quantization, and the cached fp16-numpy dataset pattern, on consumer hardware (RTX 5090, 32 GB).

The multi-character separation work has been used internally for character-LoRA training where two or more characters need to coexist coherently. It is implemented as a clean extension to ai-toolkit so it composes with the standard training surface rather than replacing it.

07 — Reflection

The pattern visible across all of this work is that the difference between a generic LoRA and one that solves a real problem is almost never the model architecture and rarely the dataset. It is the loss. Once that lens is internalized, the work changes. A new problem becomes "what does my objective actually need to penalize that the standard objective misses?" rather than "what is the right rank?" The architecture choices fall out of the problem; the data falls out of the architecture; the loss is where domain knowledge enters the system.

This is the part of the work that does not generalize cleanly and is therefore the part worth doing. ai-toolkit and DiffSynth and the other generic LoRA frameworks handle the parts that do generalize; the value I add is in the layers above them — the curated colorimetric working space, the composite loss decomposition, the validation gates, the per-problem invented terms. None of these is novel in isolation. The methodology is what is novel, and only because I happen to come at the problem from the compositing side, where "what does the answer need to look like" is a question with a real answer rather than a benchmark.

There is more to do here. The most interesting open direction is probably automated loss-term gating — a system that watches the training metrics and engages the more expensive perceptual terms only when the cheaper ones plateau, rather than the current static schedule. The framework is mostly in place; the policy that would drive it is the missing piece.

Next case study

HDR Image Generation from Diffusion Models

The most-developed application of the methodology described here.

← Back to Selected Research