
For years, Nvidia’s H100 GPUs have been the undisputed kings of the AI jungle, the apex predators that every competitor wants to dethrone.
But here in Silicon Valley — where I’m writing this and where a good underdog story is always appreciated, AMD is roaring onto the scene with its Instinct MI300 series and the upcoming MI350 series, demonstrating not just competitive prowess but a significant, open-source-driven lead in key AI training and inference workloads.
In a world where cutting-edge hardware is harder to get than a unicorn on a skateboard, AMD’s strategy of being not just “good enough” but demonstrably superior, particularly in performance-per-dollar, is a true game-changer.
Let’s chat about AMD’s Advancing AI 2025 event this week, and we’ll close with my Product of the Week, the new Eight Sleep Pod 5, which could end the war over the thermostat between spouses forever.
Unleashing Raw Power: AMD’s MLPerf Dominance
AMD’s recent debut in the MLPerf Training v5.0 benchmarks wasn’t just an entry; it was a thunderclap. This critical benchmark for AI training, focusing on real-world workloads such as fine-tuning a variant of the Llama 2 70B model using LoRA (Low-Rank Adaptation), revealed that AMD’s Instinct MI325X platform outperformed six OEM submissions using Nvidia’s H200 platform by up to 8%, according to MLPerf Training v5.0.
Let that sink in: AMD, the perennial challenger, just demonstrated a lead over Nvidia’s latest in a head-to-head, real-world AI training scenario. The AMD Instinct MI300X platforms also delivered competitive performance compared to the Nvidia H100 on the same workload, proving that both AMD GPU platforms in the Instinct MI300 Series are potent solutions for diverse training needs.
That performance edge isn’t just theory; it’s independently verified proof that AMD isn’t just playing; it’s winning.
OEM Results Confirm AMD Reproducibility
What makes this even more compelling is the widespread validation from AMD’s partners. Six OEM ecosystem partners, including giants like Dell, Oracle, Gigabyte, and QCT, submitted their own MLPerf Training results using AMD Instinct MI300 Series GPUs.
That third-party participation isn’t just AMD showing off its shiny toys; it’s an industry-wide affirmation that AMD Instinct performance can be consistently reproduced across various OEM platforms. Supermicro even broke new ground by becoming the first company to submit liquid-cooled AMD Instinct results to MLPerf, achieving a time-to-train score of 21.75 minutes with an Instinct MI325X platform, highlighting the thermal efficiency and scaling potential of advanced cooling solutions.
MangoBoost further raised the bar with the first-ever multi-node training submission powered by AMD Instinct GPUs, showcasing significant scalability with 2-node (16-GPU MI300X) setups completing training in 16.32 minutes and 4-node (32-GPU MI300X) configurations in just 10.92 minutes. These results are a testament to the maturity, flexibility, and openness of the AMD Instinct ecosystem.
Open-Source Secret Sauce: ROCm’s Rapid Evolution
At the heart of AMD’s accelerating performance is the rapid evolution of AMD ROCm, its open-source software stack.
ROCm isn’t just a side project; it’s receiving relentless developer-focused progress, with software updates rolling out every two weeks. ROCm 7, previewed for August 12, promises out-of-the-box compatibility, local execution on Windows and Linux, and significant performance uplifts — up to 3.5 times performance and 3 times training speed on the same hardware compared to prior ROCm versions.
This open-source advantage allows AMD to iterate and innovate at a pace that Nvidia’s proprietary CUDA ecosystem simply cannot match. While CUDA kernels often need to be rewritten for each new Nvidia GPU generation (you can’t just use an H100 kernel for Blackwell, for instance), ROCm’s open nature facilitates far quicker adaptation and broader compatibility.
This rapid pace of software advancement, coupled with deeper ecosystem collaboration, is demonstrably contributing to ROCm 7 running 30% faster than Nvidia’s CUDA in key inference benchmarks, according to AMD internal testing.
Beyond raw speed, ROCm 7 introduces advanced AI capabilities, including text-to-text, text-to-image, support for European language models, agent platforms, and multimodal processing — delivering up to three times the performance of ROCm 6 in inference and training workloads.
But here in Silicon Valley — where I’m writing this and where a good underdog story is always appreciated, AMD is roaring onto the scene with its Instinct MI300 series and the upcoming MI350 series, demonstrating not just competitive prowess but a significant, open-source-driven lead in key AI training and inference workloads.
In a world where cutting-edge hardware is harder to get than a unicorn on a skateboard, AMD’s strategy of being not just “good enough” but demonstrably superior, particularly in performance-per-dollar, is a true game-changer.
Let’s chat about AMD’s Advancing AI 2025 event this week, and we’ll close with my Product of the Week, the new Eight Sleep Pod 5, which could end the war over the thermostat between spouses forever.
Unleashing Raw Power: AMD’s MLPerf Dominance
AMD’s recent debut in the MLPerf Training v5.0 benchmarks wasn’t just an entry; it was a thunderclap. This critical benchmark for AI training, focusing on real-world workloads such as fine-tuning a variant of the Llama 2 70B model using LoRA (Low-Rank Adaptation), revealed that AMD’s Instinct MI325X platform outperformed six OEM submissions using Nvidia’s H200 platform by up to 8%, according to MLPerf Training v5.0.
Let that sink in: AMD, the perennial challenger, just demonstrated a lead over Nvidia’s latest in a head-to-head, real-world AI training scenario. The AMD Instinct MI300X platforms also delivered competitive performance compared to the Nvidia H100 on the same workload, proving that both AMD GPU platforms in the Instinct MI300 Series are potent solutions for diverse training needs.
That performance edge isn’t just theory; it’s independently verified proof that AMD isn’t just playing; it’s winning.
OEM Results Confirm AMD Reproducibility
What makes this even more compelling is the widespread validation from AMD’s partners. Six OEM ecosystem partners, including giants like Dell, Oracle, Gigabyte, and QCT, submitted their own MLPerf Training results using AMD Instinct MI300 Series GPUs.
That third-party participation isn’t just AMD showing off its shiny toys; it’s an industry-wide affirmation that AMD Instinct performance can be consistently reproduced across various OEM platforms. Supermicro even broke new ground by becoming the first company to submit liquid-cooled AMD Instinct results to MLPerf, achieving a time-to-train score of 21.75 minutes with an Instinct MI325X platform, highlighting the thermal efficiency and scaling potential of advanced cooling solutions.
MangoBoost further raised the bar with the first-ever multi-node training submission powered by AMD Instinct GPUs, showcasing significant scalability with 2-node (16-GPU MI300X) setups completing training in 16.32 minutes and 4-node (32-GPU MI300X) configurations in just 10.92 minutes. These results are a testament to the maturity, flexibility, and openness of the AMD Instinct ecosystem.
Open-Source Secret Sauce: ROCm’s Rapid Evolution
At the heart of AMD’s accelerating performance is the rapid evolution of AMD ROCm, its open-source software stack.
ROCm isn’t just a side project; it’s receiving relentless developer-focused progress, with software updates rolling out every two weeks. ROCm 7, previewed for August 12, promises out-of-the-box compatibility, local execution on Windows and Linux, and significant performance uplifts — up to 3.5 times performance and 3 times training speed on the same hardware compared to prior ROCm versions.
This open-source advantage allows AMD to iterate and innovate at a pace that Nvidia’s proprietary CUDA ecosystem simply cannot match. While CUDA kernels often need to be rewritten for each new Nvidia GPU generation (you can’t just use an H100 kernel for Blackwell, for instance), ROCm’s open nature facilitates far quicker adaptation and broader compatibility.
This rapid pace of software advancement, coupled with deeper ecosystem collaboration, is demonstrably contributing to ROCm 7 running 30% faster than Nvidia’s CUDA in key inference benchmarks, according to AMD internal testing.
Beyond raw speed, ROCm 7 introduces advanced AI capabilities, including text-to-text, text-to-image, support for European language models, agent platforms, and multimodal processing — delivering up to three times the performance of ROCm 6 in inference and training workloads.