Go

Go: SHA-1 Gets a Speed Boost on Loong64

Julian Zhu optimized SHA-1 hashing performance specifically for Loong64 architecture by switching from loading immediate values into registers to using a static memory table for constant keys. The change delivers impressive performance gains of up to 6% faster hashing and 6.5% higher throughput across various data sizes on Loongson-3A5000 processors.

Duration: PT4M2S

https://podlog.io/listen/go-e282e2e6/episode/go-sha-1-gets-a-speed-boost-on-loong64-c56df6b4

Transcript

Hey there, fellow Go enthusiasts! Welcome back to another episode of Go. I'm your host, and it's February 2nd, 2026. I hope you're having a fantastic start to your week, whether you're debugging that tricky function or planning out your next big feature.

Today we're diving into something really cool that happened in the Go codebase – a perfect example of how thoughtful optimization can make a real difference in performance. Sometimes the best improvements come from understanding the specific hardware you're working with, and today's commit is a beautiful illustration of that principle.

So what's the story? Julian Zhu made a targeted optimization to SHA-1 hashing specifically for the Loong64 architecture. Now, before your eyes glaze over thinking this is super niche – stick with me, because this is actually a fascinating look at how performance optimization works in the real world.

Here's what Julian did: instead of loading constant values directly into processor registers, the code now loads those same constants from a static memory table. It sounds like a small change, right? But the results are pretty impressive. We're talking about performance improvements ranging from about 3% to over 6% faster hashing, depending on the data size you're working with.

The benchmarks tell a really compelling story here. For small 8-byte hashes, we see nearly 3% improvement. But as the data size grows – 320 bytes, 1K, 8K – those gains climb to around 6%. And when we look at throughput, meaning how much data we can process per second, we're seeing improvements of up to 6.6%. That's substantial!

What I love about this commit is how it shows the importance of understanding your target hardware. The Loongson-3A5000 processor that this targets has specific characteristics that make loading from a memory table more efficient than loading immediate values. Julian recognized this and made a targeted change that takes advantage of those hardware specifics.

This is exactly the kind of optimization that makes Go such a performant language across different architectures. The Go team and contributors like Julian don't just write code that works – they write code that works well on the specific hardware where it's going to run.

Now, you might be thinking, "This is great, but I'm not working on Loong64." And that's totally fair! But here's the thing – this commit teaches us something valuable about performance optimization in general. Sometimes the biggest gains come not from algorithmic changes, but from understanding how your code interacts with the underlying hardware.

Whether you're optimizing for x86, ARM, or any other architecture, the principle is the same. Think about memory access patterns, consider how your processor handles different types of operations, and don't be afraid to measure and benchmark your changes.

Speaking of benchmarks, can we just appreciate how thorough Julian was here? The commit includes detailed before-and-after measurements across multiple data sizes, showing both execution time and throughput improvements. This is benchmarking done right – comprehensive, clear, and convincing.

For today's focus, here's what I want you to take away: if you're working on performance-critical code, spend some time understanding the hardware characteristics of your deployment targets. Learn about your processor's strengths and design your algorithms accordingly. And always, always benchmark your changes with real workloads.

Even if you're not optimizing cryptographic functions, these principles apply to whatever you're building. Maybe it's how you structure your data for better cache locality, or how you organize your loops for better branch prediction. The specifics change, but the mindset remains the same.

Alright, that's a wrap for today! Keep coding, keep optimizing, and remember – sometimes the best improvements come from understanding not just what your code does, but how it runs. Until tomorrow, happy coding!