Ollama

Ollama: MLX Runner Gets Rock Solid

Jesse Gross delivered a comprehensive overhaul of the MLX runner with two major pull requests and supporting commits focused on memory management and reliability. The changes include proper memory reporting through `ollama ps`, context limit enforcement similar to cloud services, and critical panic fixes that make the MLX runner much more stable for production use.

Duration: PT4M7S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-mlx-runner-gets-rock-solid-2bf5ddcb

Transcript

Hey there, fellow developers! Welcome back to another episode of the Ollama podcast. I'm your host, and wow, do we have some exciting updates to dive into today. Grab your favorite beverage because we're talking about some seriously solid improvements that are going to make your MLX experience so much better.

So February 28th was quite the day in the Ollama repository, and I have to give a huge shoutout to Jesse Gross who has been absolutely crushing it with MLX runner improvements. We've got two merged pull requests and a handful of supporting commits that tell a really compelling story about taking software from "works most of the time" to "rock solid reliable."

Let's start with the big story here. The first major pull request is titled "MLX runner memory fixes" and friends, this is exactly the kind of work that makes me excited about software engineering. You know how frustrating it can be when you're running models and you have no idea what's actually happening with your memory? Well, those days are over.

Jesse tackled three core problems that were making the MLX runner feel a bit unpredictable. First up, memory reporting. Before this change, when you ran `ollama ps` to check what was going on, you'd get these static estimates that only included model weights. But here's the thing - that's not the whole story! Your actual memory usage includes the KV cache and the compute graph, which can be significant. Now you get real, live memory usage reporting. It's like finally getting an accurate fuel gauge in your car instead of just guessing.

The second big improvement is context limit enforcement. This one's really smart because instead of just letting the cache grow forever until your system runs out of memory, it now behaves like most cloud services you're probably familiar with. Long prompts that exceed the model's trained context length will give you a clear error instead of mysterious failures. And if your generation exceeds the context during processing, it stops gracefully. It's that kind of predictable behavior that makes the difference between a tool you can rely on and one that keeps you guessing.

But here's my favorite fix - and this is where the second pull request comes in - they solved a nasty panic that was happening when the KV cache was fully hit. Picture this: you send the exact same prompt twice, the system thinks "oh great, I have all of this cached already!" and then... crashes with an index-out-of-range error. Not exactly the performance boost you were hoping for, right? Jesse tracked this down to the pipeline trying to access an empty slice when the entire prompt was cached. The fix is elegant - always keep at least one token to re-evaluate so the pipeline can properly seed token generation.

What I love about this whole series of changes is how they also improved error handling and reporting. Instead of errors just disappearing into log files where you might never see them, they're now properly sent back to the client. Plus, they fixed the timing metrics so you get accurate information about prompt processing speeds.

Looking at all these commits together, this feels like one of those moments where software really matures. We're talking about 196 lines added, 153 removed, across 16 files. That's not just throwing code at a problem - that's thoughtful refactoring and improvement.

Today's focus is really about stability and observability. If you're using the MLX runner, these changes mean you can trust it more, understand what it's doing better, and get clearer feedback when things don't go as expected. And if you're contributing to any project, this is a masterclass in how to approach reliability improvements - tackle memory management, fix edge cases, and improve error reporting all as part of a cohesive effort.

That's a wrap for today's episode! The MLX runner just got a whole lot more robust, and I think you're going to love these improvements. Keep building amazing things, and I'll catch you in the next episode. Happy coding!