Ollama

Ollama: Smarter Sampling and Crash Prevention

Jeffrey Morgan merged two key improvements today - a substantial enhancement to the sampling system with repeat-based sampling capabilities, and a crucial fix preventing crashes in the Qwen3Next model's DeltaNet when using split offloading. The team also collaborated with community contributor Yossi Ovadia on the crash fix.

Duration: PT3M49S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-smarter-sampling-and-crash-prevention-fcabbfd5

Transcript

Hey there, amazing developers! Welcome back to another episode of the Ollama podcast. I'm your host, and wow, do we have some exciting updates to dig into today, March 2nd, 2026. Grab your favorite beverage because we're talking about some really thoughtful improvements that are going to make your AI experiences smoother and more powerful.

Let's jump right into the big story of the day - Jeffrey Morgan just merged a fantastic enhancement to Ollama's sampling system. This is one of those changes that might sound technical at first, but it's actually pretty exciting when you think about what it means for your models.

So what's repeat-based sampling? Think of it like giving your AI a better memory about what it just said. You know how sometimes when you're talking, you might catch yourself repeating a word or phrase, and you naturally course-correct? That's essentially what this new sampling system does for language models. It looks at the token history - basically the recent words the model has generated - and uses that information to make smarter decisions about what comes next.

The implementation here is really solid too. Jeffrey added 193 lines of new code across 8 files, touching everything from the core API types to the documentation, and even adding comprehensive tests. I love seeing changes that come with proper test coverage - that's exactly the kind of thoughtful development that makes a codebase reliable and maintainable.

What's particularly cool is that this isn't just a behind-the-scenes improvement. The team updated the modelfile documentation too, which means you'll actually be able to configure and experiment with these new sampling parameters. That's the kind of user-focused thinking that makes Ollama such a joy to work with.

Now, let's talk about the second major change - a crash fix for the Qwen3Next model. This one has a great collaboration story behind it. Community contributor Yossi Ovadia identified and started work on fixing a crash that was happening in DeltaNet during split offloading. Jeffrey carried that work forward, but here's where it gets interesting - he didn't just fix the crash, he also optimized the performance.

The original issue was causing crashes when you were offloading model parts, which is a pretty critical feature for folks running larger models on systems with mixed GPU and CPU resources. But Jeffrey noticed that the fix could also address a performance bottleneck where long concat-chains were slowing down longer prompts. So he killed two birds with one stone - fixed the crash and made longer conversations faster. That's the kind of thoughtful problem-solving that I absolutely love to see.

This is also a beautiful example of open source collaboration at its best. Yossi identified the problem and started the solution, Jeffrey refined and optimized it, and now everyone benefits. It's these kinds of community contributions that make projects like Ollama so robust and reliable.

Both of these changes represent something I find really inspiring about the Ollama project - they're not just adding features for the sake of features. The sampling improvement makes models smarter and more coherent. The crash fix makes the system more reliable while also improving performance. These are changes that directly impact your day-to-day experience with AI.

So here's today's focus: if you're working with Ollama models, especially if you're doing longer conversations or working with larger models that need offloading, these updates are going to make your life better. Take some time to explore those new sampling parameters in your modelfiles. And if you're running Qwen3Next models with split offloading, you can breathe a little easier knowing those crashes are behind you.

That's a wrap for today's episode! Keep building amazing things, keep collaborating, and remember - every commit is a step forward. We'll catch you tomorrow with more updates from the world of Ollama. Happy coding!