PyTorch

PyTorch: The Great Test Speed Revolution

Today we're diving into a massive performance breakthrough in PyTorch's testing infrastructure! Howard Huang led an incredible optimization effort that slashed test execution times by over 70%, saving developers nearly 44 minutes per test run. We'll also explore updates to ROCm support, XPU compilation improvements, and some clever caching optimizations that are making PyTorch development faster across the board.

Duration: PT4M22S

https://podlog.io/listen/pytorch-2496be96/episode/pytorch-the-great-test-speed-revolution-40dbfcd9

Transcript

Hey there, PyTorch developers! Welcome back to another episode of the PyTorch podcast. I'm your host, and wow, do I have an exciting story for you today from February 3rd, 2026.

You know that feeling when you're waiting for tests to run and you could literally make a cup of coffee, maybe even bake some cookies, and still have time left over? Well, Howard Huang just became everyone's hero by tackling this exact problem head-on, and the results are absolutely mind-blowing.

Let me tell you about the star of today's show - a game-changing optimization to PyTorch's FSDP testing infrastructure. Howard introduced something called `MultiProcContinuousTest`, and friends, this is the kind of behind-the-scenes work that makes every developer's life better.

Here's the story: traditionally, PyTorch's distributed tests were using `MultiProcessTestCase`, which sounds reasonable enough, right? But here's the catch - it was spawning fresh worker processes for every single test method. Imagine if every time you wanted to test a small function, your computer had to boot up entirely new processes. That's essentially what was happening, and it was eating up massive amounts of time.

Howard's solution was brilliant in its simplicity. Instead of spawning new processes for each test, `MultiProcContinuousTest` reuses worker processes across all test methods within a test class. It's like carpooling for your tests - much more efficient!

The numbers are absolutely staggering. Across 27 test files, the total execution time dropped from over 61 minutes down to just 18 minutes. That's a 70.7% improvement, saving developers 43.3 minutes per test run! Some individual files saw even more dramatic improvements - `test_wrap.py` went from over 9 minutes down to just 38 seconds. That's a 93% improvement!

But Howard didn't stop there. In a follow-up commit, he extended this optimization to DTensor tests as well, shaving off another 7 minutes from the test suite. When you're iterating on code and running tests frequently, these kinds of improvements compound into hours of saved time every day.

Now, while Howard was revolutionizing test performance, Yu Guo was hard at work updating ROCm support with fresh commits to the composable_kernel and aiter submodules. This kind of foundational work keeps PyTorch running smoothly on AMD hardware, and it's the kind of cross-platform commitment that makes PyTorch such a robust framework.

We also saw some great progress on XPU support, with xinan.lin adding standalone compile API support in the _Exporter for XPU devices. It's exciting to see PyTorch's hardware support continuing to expand and mature.

On the optimization front, Animesh Jain delivered not one but two performance improvements to Dynamo's compilation process. First, by caching attribute source construction, and then by caching constant attributes of the InlineInstructionTranslator. These might sound like small technical details, but Animesh reported reducing their model's compile time from 10.6 seconds to 10.2 seconds. When you're dealing with large models and frequent recompilation, every fraction of a second counts.

There's also a fun update for einops users - Guilherme Leobas updated the integration so that starting with einops 0.8.2, these operations no longer need to be registered with `allow_in_graph`. This means Dynamo can trace through einops operations more efficiently, producing cleaner graphs with pure PyTorch operations.

Today's Focus: If you're working on PyTorch testing, especially distributed tests, consider how you might apply Howard's approach to your own test suites. Look for opportunities to reuse expensive setup operations across multiple tests. And if you're using einops in your projects, make sure you're on version 0.8.2 or later to take advantage of the improved Dynamo integration.

That's a wrap for today's episode! These performance improvements might happen behind the scenes, but they're the foundation that makes all of our PyTorch development more enjoyable and productive. Keep building amazing things, and I'll catch you in the next episode!