Introduction

This section describes how to configure KEDA (Kubernetes Event-Driven Autoscaler) to automatically scale vLLM deployments based on GPU KV cache utilization. Autoscaling ensures that resources are used efficiently and that the system can handle changes in workload dynamically. The primary metric used for autoscaling in this setup is GPU KV cache usage.