Ollama: Speed Revolution - MTP Decoding and Smart Caching

Three major performance-focused PRs landed today, headlined by Patrick Devine's massive MTP speculative decoding implementation for Gemma4 models and Parth Sareen's server-side caching system for model metadata. Daniel Hiltgen also continued the ongoing codebase cleanup with imagegen layout improvements.

2026-05-06T10:00:58Z

Duration: PT4M15S

Episode overview

This episode is a short developer briefing from Ollama.

It explains recent repository work in plain language.

Show: Ollama
Published: 2026-05-06T10:00:58Z
Audio duration: PT4M15S

Transcript excerpt

This excerpt keeps the crawler page concise. Listen to the episode or use the RSS feed for the full update.

Hey there, fellow developers! Welcome back to another episode of the Ollama podcast. I'm your host, and wow, do we have an exciting day to talk about! If you're working with AI models and care about performance - and let's be honest, who doesn't - today's changes are going to make you smile.

We've got three fantastic pull requests that just merged, and the theme is crystal clear: speed, efficiency, and making your development experience smoother. So grab your favorite beverage and let's dive right in!

Our headline story today comes from Patrick Devine with what I can only describe as a performance game-changer. Patrick just merged PR 15980, which brings MTP - that's multi-token prediction - speculative decoding to Gemma4 models. Now, if you're not familiar with speculative decoding, think of it like this: instead…

But here's what makes Patrick's implementation so thorough - it's not just the core algorithm. He's added support for importing safetensors-based draft models with ollama create, introduced a new DRAFT command in Modelfiles, and even included a quantize-draft flag. Plus, there's proper cache support and changes to…

Speaking of performance improvements, Parth Sareen…

Here's…

Nearby episodes from Ollama

MLX Threading and Claude Image Fixes 2026-05-12T10:00:54Z
Model Transfer Optimization and Test Reliability 2026-05-09T10:00:51Z
Claude Desktop Integration Removed 2026-05-08T10:00:32Z
Launch Command Enhancements 2026-05-07T10:00:52Z
Go 1.26 Runtime Update 2026-05-04T00:00:00Z
Weekly Recap - MLX Threading & Model Recommendations 2026-05-04T00:00:00Z
MLX Threading Fixes and Claude App Integration 2026-05-03T00:00:00Z
Model Recommendations and Windows Gateway Fix 2026-05-01T00:00:00Z