If I am trying to sway others, I would say that an org that has only known inefficiency is ill prepared for the inevitable competition and/or belt tightening, but really, it is the more personal pain of seeing a 5% GPU utilization number in production. I am offended by it. — John Carmack’s resignation letter from Meta
The truth is that GPU, CPU, RAM, and every other compute resource is probably at less than 50% utilization in any organization. There are plenty of exceptions — training jobs, for example — but this is the norm.
Supply is not elastic. GPUs can't be procured out of thin air for companies that run their own hardware (e.g., Meta). It takes time to build and deploy data centers and hardware.
Scaling latency. Even in cloud environments, it’s tough to scale 1-1 with demand. Even the best predictive and optimized algorithms can’t match the demand curve one-to-one.
Underprovisioning breaks workloads. Out-of-memory errors are notoriously hard to track down. They can seemingly come out of nowhere. Working but unoptimized code can bring down production in mysterious ways.
Organizational constraints. Resources are hard to share equally. Some teams might have more administrative power in acquiring (and protecting) resources. The idea of an internal resource economy has been tried (there was one at Google), but it almost always devolves into a power struggle.
Software constraints. Not all software can fully utilize the hardware. Think of bin-packing. Even with the best algorithms, there might not be enough right-sized workloads to fit into predetermined hardware boxes.