Introduction
This section describes how to configure KEDA (Kubernetes Event-Driven Autoscaler) to automatically scale vLLM deployments based on GPU KV cache utilization. Autoscaling ensures that resources are used efficiently and that the system can handle changes in workload dynamically. The primary metric used for autoscaling in this setup is GPU KV cache usage.
PreviousExample: Autoscaling vLLM with KEDA based on GPU KV Cache UsageNextMetric: GPU KV Cache Usage
Last updated