Ollama

Ollama: Simplifying the Sampling Story

Patrick Devine merged a significant refactor that streamlines how Ollama's MLX runner handles text generation sampling. The change replaces a complex chain of sampling interfaces with a single, stateful sampler that's much easier to work with and maintain.

Duration: PT4M5S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-simplifying-the-sampling-story-9a7282a1

Transcript

Hey everyone, and welcome back to another episode of the Ollama podcast! I'm your host, and wow, do we have an interesting story about code simplification today. You know those moments when someone looks at a complex system and says "there's got to be a better way"? Well, that's exactly what happened yesterday, and the result is pretty elegant.

Let's dive right into our main story. Patrick Devine just merged a fantastic refactor that tackles something called the sampling system in our MLX runner. Now, if you're not familiar with sampling in AI models, think of it like this - when an AI generates text, it doesn't just pick the most obvious next word every time. Instead, it uses various techniques to add creativity and randomness, like considering the top few probable words, or avoiding repetitive phrases. It's what makes AI text feel more natural and less robotic.

The old system used what's called a "chain of interfaces" - imagine having separate little workers, each handling one aspect of sampling. One worker handled TopP sampling, another handled TopK, another managed penalties for repetition, and so on. While this worked, it created a lot of complexity. You had to coordinate all these separate pieces, pass data between them, and keep track of state across multiple objects.

Patrick's solution is beautifully simple. Instead of all these separate interfaces, he collapsed everything into a single, stateful Sampler struct. Think of it like replacing a relay team with one really capable runner who can handle the whole race. This new sampler holds both the sampling options and the history of what's been generated so far, all in one place.

The technical changes span seven files and add over 250 lines while removing about 50 - which tells us this isn't just moving code around, but actually implementing a more robust solution. What I love about this approach is that it switches from interface-based sampling to function-based transforms. Instead of having objects that talk to each other, you now have functions that transform the data as it flows through. It's cleaner, easier to test, and much easier to reason about.

The new system implements several key sampling techniques - top_p sampling, which looks at the most probable tokens up to a cumulative probability threshold, repeat_penalty to discourage repetitive text, and frequency_penalty to add variety. Plus, there's mention of min_p sampling, which is a newer technique that's gaining popularity in the AI community.

One thing that really stands out is that Patrick added comprehensive tests - there's a whole new test file with 62 lines of testing code. This kind of attention to testing during a refactor shows real craftsmanship. When you're changing how a core system works, having solid tests gives everyone confidence that the new approach actually works as intended.

The pipeline wiring got updated too, which means this change touches the entire flow of how requests move through the system. The sampler now gets seeded with prompt tokens at the beginning and accumulates generated tokens as it goes. It's like giving the sampler a memory of the entire conversation, which leads to better, more contextually aware text generation.

What I find encouraging about this change is that it's the kind of refactor that makes future development easier. When you have a simpler, more cohesive design, adding new sampling techniques or debugging issues becomes much more straightforward. It's an investment in the codebase's future.

For today's focus, if you're working on any system that feels overly complex or has too many moving pieces, take inspiration from this refactor. Sometimes the best solution isn't adding more interfaces or abstractions - sometimes it's consolidating and simplifying. Look for opportunities to reduce the number of objects that need to coordinate with each other.

That wraps up today's episode! A big thanks to Patrick for this thoughtful refactor that's going to make everyone's life easier. Keep coding, keep learning, and I'll catch you tomorrow with more stories from the Ollama codebase. Until then, happy developing!