Ollama

Ollama: Smart Caching and Better User Experience

Today brings exciting performance improvements with smart caching snapshots for long prompts, plus thoughtful user experience enhancements. The team focused on making Ollama more reliable for heavy workloads while polishing the developer experience with better VS Code integration and helpful context length warnings.

Duration: PT4M7S

https://podlog.io/listen/ollama-3aed006f/episode/ollama-smart-caching-and-better-user-experience-ee09054d

Transcript

Hey there, fellow developers! Welcome back to another episode of the Ollama podcast. I'm so excited to chat with you today about what's been happening in our favorite local AI toolkit. Grab your coffee because we've got some really cool updates to dive into!

So yesterday and today have been absolutely buzzing with activity - we're talking seven merged pull requests and a bunch of additional commits that are really moving the needle on performance and user experience. The story today is all about making Ollama smarter and more user-friendly, and I think you're going to love what the team has been cooking up.

Let's start with the star of the show - Jesse Gross has been working on some seriously impressive caching improvements. The big one is this new periodic snapshot feature for the MLX runner. Here's the thing - if you've ever worked with really long prompts, you know the pain of having to reprocess everything from scratch when something goes wrong. Well, Jesse's solution is elegant: the system now takes snapshots every 8,000 tokens during prefill, plus one near the end of your prompt. This means if you need to retry generation or continue thinking, you're not starting from zero. It's like having save points in a video game, but for your AI processing! The implementation touches the cache system, pipeline, and includes comprehensive tests - exactly the kind of thorough work that makes me excited about where this project is heading.

But Jesse wasn't done there! There are also some really smart improvements to the eviction and LRU tracking system. Instead of updating all snapshots along a path, it now only updates the ones actually used during processing. This makes the cache much more accurate at deciding what to keep and what to toss out. It's one of those changes that sounds simple but represents really deep thinking about performance optimization.

Now, Eva has been absolutely crushing it on the user experience front. There's this fantastic new warning system that kicks in when your server context length is below 64K tokens for local models. The system now exposes context length info through the API status endpoint and gives you a heads up during model selection if you might run into limitations. As someone who's definitely been caught off guard by context limits before, this kind of proactive guidance is exactly what we need.

Eva also tackled a sneaky VS Code integration issue that I bet has bitten some of you. The launcher was checking for "code" on your PATH first, which could accidentally launch Cursor or another VS Code fork instead of actual VS Code. Now it checks platform-specific paths first and validates that you're actually getting VS Code. It's a small change but shows real attention to the details that matter in daily development workflows.

Speaking of VS Code, there's been some strategic thinking happening around integrations. Parth has temporarily hidden both the VS Code and Cline integrations from the main interface while keeping them accessible through direct commands. Sometimes the best user experience means being selective about what you surface, and I appreciate this thoughtful approach to feature rollout.

On the polish side, Parth also updated the chat title in the TUI, and Eva put significant work into updating the VS Code documentation with fresh screenshots and better guidance. These kinds of improvements might seem small, but they're the difference between software that works and software that feels great to use.

Daniel also squeezed in some important CI improvements, making sure MLX JIT headers are properly included on Linux builds - the kind of infrastructure work that keeps everything running smoothly behind the scenes.

Today's focus should definitely be on testing these caching improvements if you're working with long prompts or complex conversations. The snapshot system could be a real game-changer for your workflows, especially if you're doing iterative development or working with large contexts.

That's a wrap for today! The Ollama team continues to impress with this balance of performance innovation and user experience refinement. Keep building amazing things, and I'll catch you in the next episode with more exciting updates from the world of local AI development!