Ollama

Ollama: Tokenizer Love and Better Model Support

Today we're diving into some fantastic tokenizer improvements that make Ollama even more versatile! Daniel Hiltgen delivered two key enhancements - adding SentencePiece-style BPE support for better model compatibility, and fixing a tokenizer configuration bug in the MLX pipeline. Plus, Parth Sareen updated the Pi integration docs to help more developers get started.

Duration: PT3M52S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-tokenizer-love-and-better-model-support-f0c1f0f9

Transcript

Hey there, fellow developers! Welcome back to another episode of the Ollama podcast. I'm your host, and wow, do we have some exciting updates to chat about today. Grab your favorite beverage because we're diving into some really cool tokenizer improvements that are going to make your AI models work even better.

So picture this - you know how different AI models sometimes have their own quirky ways of handling text? Well, the Ollama team has been hard at work making sure we can support even more of these models seamlessly. And today's changes are a perfect example of that dedication to compatibility and correctness.

Let's start with the star of the show - Daniel Hiltgen just merged a fantastic enhancement that adds SentencePiece-style BPE support to our tokenizer. Now, if you're thinking "wait, what's that?" - don't worry, I've got you covered. Byte Pair Encoding, or BPE, is basically how we break down text into smaller pieces that AI models can understand. Think of it like teaching a computer to read by showing it common letter combinations.

The cool thing about this update is that some models use a special Unicode character - U+2581 - to represent spaces. It's like a secret code for spaces that certain models prefer. Daniel's implementation adds a new option called WithSentencePieceNormalizer that handles this conversion automatically. What I love about this approach is that it doesn't break existing code - if you're using NewBytePairEncoding, everything works exactly as before. But if you need that SentencePiece magic, you can use the new NewBytePairEncodingWithOptions constructor.

The attention to detail here is impressive too - it's not just about encoding the spaces differently, but also making sure the decoding process converts everything back correctly. And with 180 new lines of test code, you know this feature is rock solid.

But wait, there's more! Daniel also fixed a subtle but important bug in the MLX pipeline. This one's a great example of why details matter in AI systems. The code was hardcoding a "true" value when deciding whether to add a special beginning-of-sequence token to prompts. But different models have different preferences here - some want that token, others don't.

The fix was beautifully simple - just one character change! Instead of always passing true, it now checks the tokenizer's actual configuration. Models like Gemma3 and Llama that want the BOS token still get it, while models like Qwen3 that don't need it are left alone. It's one of those changes that makes you go "oh, that's so much cleaner" when you see it.

And speaking of making things cleaner and more accessible, Parth Sareen updated the Pi integration documentation. Good documentation is like a welcome mat for new contributors and users, so kudos to Parth for keeping that side of things fresh and helpful.

What I find really encouraging about today's updates is how they reflect the collaborative nature of open source development. These aren't flashy new features that grab headlines, but they're the kind of thoughtful improvements that make a real difference when you're actually building something. Daniel's tokenizer work means more models will work out of the box, and that MLX fix prevents those subtle bugs that can be so frustrating to track down.

Today's Focus: If you're working with custom models or experimenting with different tokenization approaches, definitely check out that new SentencePiece support. And if you've been having any weird issues with BOS tokens in MLX pipelines, this update might just solve your problems. Don't forget to update your Pi integration setup if that's part of your workflow.

That's a wrap for today's episode! Keep building amazing things, and remember - sometimes the best improvements are the ones that just make everything work a little bit better. Catch you next time, and happy coding!