Five diffusion papers worth reading: June 12–15, 2026 (weekend + Monday batch)

Five diffusion papers worth reading: June 12–15, 2026 (weekend + Monday batch)

This extended 72-hour batch (Friday PM through Monday) yields five papers spanning the full deployment stack. HeyGen's Avatar V (arXiv 2606.13872) beats Veo 3.1 at Face Similarity (0.840 vs 0.714) using Sparse Reference Attention conditioning on full reference video sequences. Snap Research's CineOrchestra (2606.13768) unifies all four cinematic control axes in one video DiT via two parameter-free coordinated RoPEs. Adobe's HiLo-Token (2606.13898) delivers 3.13× DiT speedup with zero quality regression, already live in Photoshop. UC Berkeley's DiPOD (2606.13795) fixes the "double drift" failure in diffusion RL, lifting Sudoku accuracy from 25% to 97%. Cambridge's recursive DMs theory (2606.13796) gives the first closed-form characterization of training collapse and an annealed truncation schedule that eliminates it.

ArXiv Diffusion Models Digest
2026. 6. 15. · 22:23
구독 2개 · 콘텐츠 26개

리서치 브리프

This is the extended weekend + Monday batch covering Friday PM (June 12) through Monday (June 15) — roughly 72 hours during which ArXiv does not process Saturday or Sunday submissions. Five papers made the cut from 417 new entries scanned (131 cs.CV + 286 cs.LG). The range this time is wider than a typical daily batch: one production-scale deployment paper from HeyGen with win rates against Veo 3.1 and Kling O3 Pro, a new unified cinematic control framework from Snap Research, a live-in-production token compression result from Adobe, a UC Berkeley RL stability fix that turns 25% Sudoku accuracy into 97%, and a Cambridge proof about what happens when you train a diffusion model on its own outputs.

Speed-read table

PaperarXivInstitutionCore methodKey numberCode / demo
Avatar V2606.13872HeyGen ResearchFlow-matching DiT with Sparse Reference Attention on full video token sequencesFace Sim 0.840 vs Veo 3.1 0.714; win rate 72.5% vs Veo 3.1Project page
CineOrchestra2606.13768Snap Research / UC MercedEntity-centric conditioning + two parameter-free coordinated RoPEsShot Transition Recall 0.486 (next best: 0.431)GitHub
HiLo-Token2606.13898AdobeAdaptive high/low-frequency token split via Sobel edge detection; dilated mask retentionDiT speedup 3.13× (small masks), end-to-end 1.77× on A100; 0 quality regression— (patent filed)
DiPOD2606.13795UC Berkeley / Impossible Inc. / NVIDIAAlternating self-distillation + policy-gradient to keep ELBO tightSudoku 97.56% (baseline SPG: 25.12%)GitHub
Recursive DMs theory2606.13796University of CambridgeClosed-form collapse distribution; Hermite spectral decomposition; annealed truncation scheduleGeometric convergence rate κ = √(1-α) e^{-t₀/2}

1. Avatar V: HeyGen's production avatar system beats Veo 3.1 at Face Similarity

arXiv: 2606.13872 | HeyGen Research (Benjamin Liang lead, Zhenhui Ye corresponding; 23 authors) | cs.CV
Peer-review status: Preprint. Project page public. No open weights or inference code.
The clearest way to state what Avatar V does differently: most avatar systems encode identity as a static embedding — a compressed vector representing what a person looks like. Avatar V conditions instead on the full token sequence of a reference video, with no bottleneck compression, capturing not just facial geometry but talking rhythm, micro-expressions, and gestural tendencies. 1
The mechanism enabling this is Sparse Reference Attention — an asymmetric design where generation tokens attend to all reference tokens (full receptive field over the reference), but reference tokens only self-attend. This keeps attention complexity linear in reference length, rather than quadratic, which is what makes arbitrarily long reference videos practical. 1
Training is five stages: Text-to-Video pretraining → Audio-to-Video pretraining → Personality SFT → Two-phase distillation (CFG distillation then DMD, yielding more than 10× inference acceleration, down to 24 denoising steps) → RLHF with GRPO and DPO. The data engine processed 50M raw videos into 100M+ training clips using a cross-clip identity connectivity graph to preserve identity coherence across clips from the same person. 1
On a 70-case cross-scene benchmark, Avatar V reaches Face Similarity 0.840 versus Veo 3.1's 0.714 and achieves LSE-C 8.97 (lip sync, higher is better) and LSE-D 6.75 (lower is better) — the paper claims both metrics surpass even the ground-truth recordings. Pairwise win rates from human evaluation: 68.9% versus Seedance 2.0, 69.6% versus Kling O3 Pro, 72.5% versus Veo 3.1, 85.7% versus OmniHuman 1.5. 2
Avatar V DiT architecture: noisy video latent, reference image, audio features, text, video reference tokens, and motion tokens are patchified and concatenated, then processed through L transformer blocks with Sparse Reference Self-Attention, text/audio cross-attention, and a Motion Injection Module
Avatar V's transformer block processes six input modalities; Sparse Reference Self-Attention is the key mechanism for linear-complexity conditioning on full reference video sequences. 2
Deployed across 5,000+ GPUs. The infrastructure stack includes a custom compiler using an LLM-based "agentic kernel synthesis" with an evolution island strategy, achieving 3× latency reduction over the unoptimized baseline and 33% over torch.compile Inductor. Inference generates 1080p video in 6.4-second chunks. 3
Why read it: The Sparse Reference Attention design is the transferable piece — any generation task that needs linear-complexity conditioning on long reference sequences is a candidate for this pattern. The deployment details (NVSHMEM sequence parallelism, streaming VAE decode, kernel synthesis) are also unusually specific for a preprint and are worth reading for anyone building production-scale DiT inference.

2. CineOrchestra: one model, four cinematic control axes simultaneously

arXiv: 2606.13768 | Sharath Girish, Tsai-Shien Chen et al. (Snap Research + UC Merced; 7 authors) | cs.CV
Peer-review status: Preprint. GitHub repository confirmed.
Prior cinematic control frameworks tackle one axis at a time: subject personalization (who is in the video), event timing (when things happen), camera movement (how the frame moves), and shot transitions (how scenes cut). CineOrchestra's central claim is that all four axes share a common structure, making a unified architecture possible rather than an ensemble of four bespoke systems. 4
The shared structure: every cinematic element — a character, a camera pan, a scene transition — is an entity acting over a specific temporal interval. With that framing, the conditioning problem collapses to a single positional encoding problem: how to route each entity's conditioning signal to its target spatiotemporal region without the signals interfering. 5
Two parameter-free solutions:
  1. Interval-sampled temporal RoPE with β(L) duration rescaling. Standard RoPE assigns positions uniformly across a sequence, which creates attention inconsistency when events have wildly different durations (a 0.1-second hard cut versus a 10-second slow pan). The interval-sampled variant rescales positional frequencies so that attention similarity peaks remain duration-invariant. 5
  2. 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions. Each entity gets conditioning routed to its specific spatiotemporal region rather than broadcast globally. 5
No new learnable parameters are added over the base video DiT. Both RoPEs are modifications to positional encoding only. 4
On CineBench (512 real movie/TV clips, 3.2k entities, 6.9k events), CineOrchestra achieves Shot Transition Recall 0.486 — the best of all seven methods tested, ahead of Phantom at 0.431. Subject identity (M-DINO) scores 0.502. In a 512-prompt user study across 8 dimensions, CineOrchestra is preferred on every entity-, text-, and structure-related dimension against all six per-axis specialist baselines. 5
A notable generalization result: the model was trained on 10-second clips but generates coherent 40-second sequences at inference — 4× longer than any training example — without fine-tuning. 4
Why read it: The entity-centric framing is the conceptual contribution worth extracting. The argument that heterogeneous control signals (persons, cameras, transitions) all instantiate the same entity-over-interval primitive is a unification move applicable to other multi-axis control problems. The duration-invariant RoPE design has direct relevance for any transformer conditioning on events with highly variable temporal spans.

3. HiLo-Token: Adobe ships 3.13× DiT speedup inside Photoshop with zero quality regression

arXiv: 2606.13898 | Haoran You, Yotam Nitzan et al. (Adobe ART AI Lab + Adobe Research; 10 authors) | cs.CV
Peer-review status: Preprint. Patent filed. No open-source code.
The paper opens with a deployment fact that sets the stakes: within 28 days of Photoshop v27.0 shipping in October 2025, 1.1 million out of 3.3 million users engaged with Generative Fill, generating 36.2 million total interactions and consuming 82.8 million generative credits. 6 Even after distilling Adobe's MultiEdit model from 50 to 8 timesteps, the DiT alone still consumes 73% of total inference latency — the bottleneck is the transformer, not the VAE or the surrounding pipeline. 7
HiLo-Token's approach: split the input image into two token populations based on spatial frequency, keep different proportions of each, and let the mask dictate the local token budget.
HiLo-Token framework: the left panel shows dual patchify branches (16× downsampled low-freq and Sobel-selected high-freq), producing a merged compressed token sequence; the right panel shows the token selection pipeline with mask dilation, Sobel spatial frequency map, and regionalization
HiLo-Token selects tokens adaptively based on spatial frequency: all tokens inside the dilated user mask are kept, Sobel edge detection picks high-frequency tokens outside the mask, and 16× downsampled tokens represent low-frequency global structure. 7
Inside the user mask: all tokens retained at full resolution. Outside the mask: high-frequency regions (Sobel edge magnitude above threshold) kept at original scale; low-frequency regions replaced with 16× downsampled tokens. The token budget shrinks automatically as masks get smaller, and the method requires no attention-based importance scoring — which matters for inpainting specifically, because the content being generated doesn't yet exist at early diffusion steps, making cross-region attention signals unreliable. 6
Results on A100-80GB: DiT speedup 3.13× (small masks, avg 6.38% mask ratio), 2.59× (medium, avg 15.92%), 1.67× (large, avg 35.36%). End-to-end speedup is 1.77×, 1.66×, 1.33× respectively. User study shows tie rates of 48% (Remove), 70% (Generative Fill), 81% (Generative Expand) between HiLo-Token and the uncompressed baseline — and win counts are comparable. No quality regression. 7
Infrastructure impact: 33% reduction in AWS p5.48xlarge (8×H100) nodes required to serve the Remove feature, at $55.04 per node-hour. 6
Why read it: A solved production problem at scale, not a lab prototype. The frequency-based token selection idea — and specifically the argument against attention-based importance prediction for inpainting — applies to any DiT serving task where the edit region is available at inference time. The method description is complete enough to replicate despite no open code.

4. DiPOD: fixing the "double drift" failure in diffusion policy RL

arXiv: 2606.13795 | Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab (UC Berkeley + Impossible Inc. + NVIDIA) | cs.LG
Peer-review status: Preprint. Code open-source.
Current variational diffusion RL methods (FPO, SPG) use the ELBO as a proxy for log-likelihood during policy gradient updates. This works near initialization because the ELBO is tight there. But RL updates are not structure-preserving: as policy gradient steps move the model away from its starting point, the ELBO loosens — it drifts from log-likelihood. That is the first drift. The loosened ELBO then causes proxy policy gradient updates to diverge from the true policy gradient direction. That is the second drift. Together: double drift. 8
DiPOD landscape diagram: a 3D expected-reward surface with three labeled parameter-space points (x, y, z) and three probability distribution panels; the trajectory of prior algorithms (red) drifts off the reward peak; DiPOD (blue) alternates policy gradient steps with self-distillation pulls back toward tight-ELBO regions; top-right shows GSM8K reward curves for FPO vs DiPOD
DiPOD alternates policy gradient steps with self-distillation (which tightens the evidence bound without changing the policy distribution), keeping the surrogate gradient reliable throughout training. 8
DiPOD's fix: after each policy gradient step, run a self-distillation step that maximizes the ELBO on rollouts drawn from the current policy, pulling the model back into the tight-ELBO regime. The key property of self-distillation is that it tightens the evidence bound without changing the policy distribution — so it recalibrates the surrogate without undoing the policy improvement. 8
Implementation: add a β = 0.05 ELBO regularization term to the standard diffusion RL update (Equation 7). Drop-in augmentation for existing algorithms. 9
Results on LLaDA-8B-Instruct (discrete diffusion language model):
TaskSPG baselineSPG + DiPODGain
Sudoku25.12%97.56%+72.44 pp
Countdown51.95%80.08%+28.13 pp
GSM8K84.23%84.91%+0.68 pp
MATH50037.80%40.00%+2.20 pp
The Sudoku and Countdown gains are large; the math gains are small. The authors' interpretation: the fixed context window of the diffusion language model is the bottleneck for harder math reasoning — DiPOD fixes the optimization dynamics but cannot remove the architectural limit. 8
DiPOD also validates in continuous control: on a Unitree G1 humanoid robot motion-tracking task, one self-distillation initialization pass followed by FPO++ training improves reward and tracking duration over the baseline. 8
Why read it: The diagnostic is the useful part — "ELBO loosening causes proxy gradients to drift" is a precise failure mode, not a vague warning. If you are running any variational RL procedure on a diffusion model (language, vision, or control), this paper gives you a way to check whether the failure mode is active (track the variational gap D^L during training) and a simple fix if it is. The code is public.

5. Recursive diffusion training: Cambridge proves truncation alone drives collapse

arXiv: 2606.13796 | Naïl B. Khelifa, Richard E. Turner, Ramji Venkataramanan (University of Cambridge) | stat.ML / cs.LG
Peer-review status: Preprint. No code released.
The setting: train a diffusion model, generate synthetic data with it, train the next generation on the mixture of real and synthetic data, repeat. This is already happening in practice — any model trained on internet-scale data collected after 2022 is very likely training on outputs of earlier models. The question this paper answers formally is: what does the output distribution converge to? 10
The answer requires confronting a subtle issue. Even with a perfect score estimator and exact sampling, the reverse diffusion process must stop at some small time t₀ > 0 (truncation) for numerical stability — it cannot actually run to t = 0. That truncation introduces a small Gaussian smoothing of the data distribution at each generation. Across generations, these smoothings compound. 10
The paper's main result (Theorem 3.1): the recursion converges geometrically to a unique limiting distribution, p∞⋆, which has a closed-form expression as an infinite mixture of progressively Gaussian-smoothed versions of the original data:
p∞⋆ = α Σ_{k=0}^{∞} (1-α)^k U_{(k+1)t₀}(p_data)
where α is the fraction of real data mixed in per generation and U_t is the heat-semigroup operator (Gaussian blur at scale t). Convergence rate: W₂(p^N, p∞⋆) ≤ κ^N · W₂(p_data, p∞⋆), with κ = √(1-α) e^{-t₀/2}. 10
The spectral picture (Proposition 3.3): via Hermite spectral decomposition, recursive training acts as a low-pass filter. High-order Hermite modes (encoding fine-grained non-Gaussian structure — multimodality, tail behavior, sharp textures) are attenuated far more severely than low-order modes. Higher generation count and lower α accelerate this attenuation.
Four side-by-side heatmaps showing 2D Hermite coefficients for p_data and the collapse distributions at α = 0.1, 0.5, and 0.9. The p_data panel shows high-order checkerboard patterns across both axes; the α = 0.1 panel is nearly blank, showing near-total high-frequency attenuation; the α = 0.9 panel retains moderate structure but still shows visible high-frequency loss compared to p_data
Hermite coefficient heatmaps comparing p_data (left) to the collapse distribution p∞⋆ at three values of α. High-order modes (n₁, n₂ > 5) are progressively erased as α decreases; by α = 0.1, almost all fine-grained structure is gone. 11
A positive result (Theorem 4.1): any annealed truncation schedule where t₀(N) → 0 as N → ∞ asymptotically eliminates collapse in the error-free regime. The paper validates this on CIFAR-10 across 8 recursive generations: FID converges at the rate predicted by κ, and a β-annealed schedule t₀/(1+i)^β eliminates it, with larger β converging faster. 10
The collapse occurs even with unlimited real data injection as long as α < 1 and t₀ > 0. Higher α slows it but does not prevent it. 10
Why read it: This is the most rigorous theoretical treatment of recursive training collapse to date. The closed-form limit distribution makes it testable — you can measure the spectral signature of your model and compare it against p∞⋆ to diagnose how far collapse has progressed. The annealed truncation schedule result is an actionable mitigation, not just a characterization. Richard E. Turner (Cambridge) has a long record in diffusion and Gaussian process theory, and this paper has the expected standard of rigor from that group.

Summary table

PaperarXivInstitutionCodeVenue
Avatar V2606.13872HeyGen ResearchProject pagePreprint
CineOrchestra2606.13768Snap Research / UC MercedGitHubPreprint
HiLo-Token2606.13898Adobe— (patent filed)Preprint
DiPOD2606.13795UC Berkeley / Impossible Inc. / NVIDIAGitHubPreprint
Recursive DMs2606.13796University of CambridgePreprint
Cover image: AI-generated abstract visualization of diffusion process trajectories.

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.

  • 로그인하면 댓글을 작성할 수 있습니다.