AI workloads are expensive regardless of how they are implemented. Third-party APIs with significant markups. Cloud-prem SaaS companies with hefty management fees. And the total cost of ownership for either renting GPUs from AWS or buying directly.
In the short term, there’s a significant arbitrage with a great DevOps team — namely, how do you scale expensive workloads to zero when they aren’t in use? Or right-size them accordingly as the load increases or decreases? Doing this can flip unprofitable unit economics (or provide more efficiency to the money a startup has raised).
An obvious objection: we already have serverless environments that scale to zero — Google Cloud Run, AWS Lambda, WebAssembly runtimes, and more. The problem is that these runtimes are explicitly tuned for generic workloads and aren’t made for specialized hardware (read: GPUs).
There’s two elements to “scale to zero” for GPU-bound workloads.
First, the actual machines. On AWS, this would be autoscaling groups (ASGs). As CPU load increases (or another metric you’re measuring), this will scale up instances (virtual machines). But ASGs on their own are rarely sufficient to scale to zero. You also are probably bin-packing multiple workloads on expensive GPU-powered machines. Maybe running different models at different times or rolling out new versions of models. For this, you probably want to deploy with a different primitive than raw machine images, something like a container. And for that, you need Kubernetes.
The second scale-to-zero mechanism is scaling the actual workload (the pods, deployments, etc.). There’s not really a great way to do this today. Most organizations have built their own hacks. Knative provides the machinery but can be challenging to deploy and manage and comes with its own heavyweight dependencies (like Istio). The high-level workflow is this: queue up the requests and launch a new deployment if an endpoint is unavailable.
Scale-to-zero will probably be necessary for the near term as organizations either need to deploy (1) on-prem models for data security or (2) custom models or infrastructure to serve a particular use case.