Ollama: MLX Performance Breakthrough and Smarter Caching
The Ollama team delivered major MLX improvements with a massive update that brings 6.4x speed improvements through new CUDA kernels, plus smarter caching logic for transformer models. Daniel Hiltgen led the MLX update while Jesse Gross enhanced cache performance with better partial matching capabilities.
Duration: PT4M8S
Transcript
Hey there, code crafters! Welcome back to another episode of the Ollama podcast. I'm your host, and wow, do we have some exciting performance stories to dive into today. Grab your favorite beverage because we're talking about some seriously impressive speed improvements that are going to make your day.
So picture this - you're running a model and suddenly it's over six times faster. Not 6 percent, not 60 percent, but 6.4 times faster! That's exactly what happened with the massive MLX update that Daniel Hiltgen just merged. This isn't just a small tweak - we're talking about a complete refresh of the MLX integration with 497 lines added and changes across 18 files.
Here's the story behind this update. Daniel pulled in the latest MLX changes from March 16th, but the real magic happened when they added the CUDA Fast Gated Delta kernel. When they tested it on a Qwen 3.5 model with an RTX 5090, the prefill speed jumped from 529 tokens per second to over 3,300 tokens per second. That's the kind of performance boost that makes you do a double-take at your terminal!
But Daniel didn't stop at the performance improvements. They also cleaned up some technical debt that had been lurking in the codebase. You know how it goes - sometimes when you're moving fast, little vendoring bugs creep in. This update caught and fixed those issues, plus renamed some version files to make everything clearer for future developers. It's that attention to detail that makes a codebase maintainable in the long run.
Now, while Daniel was supercharging the MLX performance, Jesse Gross was working on something equally clever but in a completely different area - caching. Jesse tackled a really nuanced problem with how Ollama handles partial cache matches in transformer models.
Here's what was happening before: when the system encountered a partial match in the cache, it would basically throw up its hands and truncate everything back to the parent snapshot. This meant that pure transformer caches weren't taking advantage of their ability to rewind to arbitrary boundaries. Jesse's fix restores that capability, which translates to better cache hit rates and more efficient memory usage.
I love this kind of optimization because it shows deep understanding of the underlying architecture. Transformer models have this beautiful property where they can rewind to any point, but the previous caching logic wasn't leveraging that. Now it does, and that means better performance for everyone using transformer-based models.
Jesse also added a couple of smaller but important improvements - better error handling with panic on double unpin operations, and enhanced debugging capabilities by showing time since last use in cache dump trees. These might seem like small details, but they're the kind of developer experience improvements that save hours of debugging time.
Oh, and we can't forget Parth Sareen's contribution to the documentation - updating the Claude Code integration guide with Telegram support. Sometimes the most valuable contributions are the ones that help people actually use all these performance improvements we're building.
Today's Focus: If you're running MLX-based models, especially on CUDA hardware, this is the perfect time to pull the latest changes and see those performance gains firsthand. Take a moment to benchmark your common workloads before and after the update - I'd love to hear what kind of speedups you're seeing in your specific use cases.
For those of you working on caching systems, Jesse's approach here is worth studying. The key insight is understanding the specific capabilities of your underlying architecture and making sure your optimization layer doesn't accidentally constrain those capabilities.
That's a wrap for today's episode! Remember, every performance improvement starts with someone noticing that things could be better and then doing the work to make it happen. Keep building, keep optimizing, and I'll catch you tomorrow with more stories from the world of Ollama development. Until then, happy coding!