Ollama

Ollama: DFlash Speculative Decoding Rollback

Jesse Gross reverted the recently merged DFlash speculative decoding feature due to invasive code integration, then re-implemented useful components as separate, cleaner commits. The rollback removed over 1,600 lines of code while preserving core improvements.

Duration: PT1M42S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-dflash-speculative-decoding-rollback-7a993b32

Transcript

Good morning, this is your Ollama development update for May 23rd, 2026.

Jesse Gross merged a significant revert of the DFlash speculative decoding feature that was introduced in pull request 16134. The revert removes over 1,600 lines of code across 13 files in the MLX runner system. Gross cited the integration as "too invasive," noting that DFlash-specific logic had spread throughout the pipeline, base model interfaces, and cache layer, with model-specific code now embedded in the recurrent cache.

Following the revert, Gross immediately began reintroducing the valuable components as separate, cleaner commits. Three follow-up commits preserve the useful functionality: gated-delta recurrent state now operates in float32 precision for better numerical stability, draft model architecture detection now reads from config.json rather than being hardcoded, and YaRN RoPE helper functions have been moved to a shared location for broader reuse across models.

The revert demonstrates careful technical stewardship - recognizing when a feature, while functional, creates too much coupling between system components. The approach of extracting and reimplementing the beneficial parts separately should result in better code organization and maintainability.

What's next: Watch for additional commits that may reintroduce speculative decoding with a more modular design, and potential performance testing of the float32 gated-delta improvements.

That's your Ollama update. Back tomorrow with more development news.