Memory Magic and Command Makeover
Today brought some serious memory optimization wizardry with MLA absorption for GLM models - though it took a few tries to get the CUDA builds just right! Plus, the team made the CLI more intuitive by renaming `ollama config` to `ollama launch`, and we got some nice fixes for image generation support.
Duration: PT3M58S
https://podlog.io/listen/ollama-3aed006f/episode/memory-magic-and-command-makeover-026838d7
Transcript
Hey there, fellow developers! Welcome back to another episode of the Ollama podcast. I'm your host, and wow - what a day it's been in the codebase! Grab your favorite beverage because we've got some really exciting stuff to dive into.
So the big story today is all about memory optimization, and let me tell you - it's been quite the journey! Jeffrey Morgan has been working on something called MLA absorption for GLM models, which is essentially a way to compress the KV cache and use way less memory. Think of it like organizing your closet - instead of having everything spread out taking up tons of space, you're folding everything neatly and suddenly you have room to breathe.
Now here's where it gets interesting - this feature had quite the adventure getting merged. It went through what I like to call the "third time's the charm" dance. First it got merged, then it had to be reverted because of some CUDA build issues, and then Jeffrey came back with a fix and got it merged again. It's one of those perfect examples of how complex systems work in the real world - sometimes you need to take a step back to move forward.
The technical bits are pretty fascinating if you're into the weeds. They're splitting combined KV_B tensors into separate K_B and V_B tensors, enabling this Multi-head Latent Attention compression. The tricky part was getting all the CUDA configurations just right across different GPU architectures. There were issues with shared memory limits on older Maxwell architectures, and some array size calculations that were causing compilation failures. But Jeffrey worked through each issue methodically - adjusting thread counts, fixing memory configurations, and making sure everything plays nice across different CUDA versions.
Moving on to user experience improvements - Parth Sareen made a really smart change to the CLI that I think everyone's going to appreciate. The old `ollama config` command has been renamed to `ollama launch`, and the behavior is much more intuitive now. Instead of just configuring things and stopping there, the default behavior is to actually launch your integration right away. Because let's be honest - when you're setting up an integration, you usually want to use it immediately, right? There's still a `--config` flag if you want the old behavior of just configuring without launching.
We also got some nice fixes for image generation. Jeffrey tackled a panic issue in `ollama show` for image generation models and improved support for image editing. One particularly clever fix was flattening transparent PNG images onto a white background for better results. It's those kinds of thoughtful touches that make the whole experience smoother.
And speaking of community contributions, we had a nice little update from Stillhart who updated the Ruby gem recommendation in the README. They pointed out that the current recommendation wasn't the most popular or well-maintained option in the Ruby ecosystem, so now we're pointing folks to ruby_llm instead. It's a small change, but these kinds of community improvements really add up over time.
The documentation got a refresh too, with updates all around the new launch command and integration workflows. It's great to see the docs staying in sync with the code changes.
For today's focus, if you're working with GLM models, definitely check out the new MLA absorption feature - it could significantly reduce your memory usage. And if you're building any CLI tooling or integrations, take a look at how the team renamed and restructured the launch command. There's a nice lesson there about making default behaviors match user expectations.
That's a wrap for today! Ten merged PRs, ten additional commits, and a whole lot of progress on both the performance and usability fronts. Remember, every line of code is a step forward, even when we sometimes need to step back first. Keep building amazing things, and I'll catch you next time!