KEDA Configuration Example
Below is a sample ScaledObject definition for KEDA to scale a vLLM deployment based on GPU KV cache metrics collected from Prometheus:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: prometheus-scaledobject
namespace: vllm
spec:
maxReplicaCount: 2
minReplicaCount: 1
cooldownPeriod: 120 # Cooldown period in seconds before scale-in
pollingInterval: 30 # Polling interval for metrics (in seconds)
scaleTargetRef:
name: llama-32-3b-instruct
triggers:
- type: prometheus
metadata:
serverAddress: http://l2-jatiluhur.metric.cloudeka.ai
threshold: '0.5' # This is equivalent to 50%
query: vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}
Explanation:
minReplicaCount
: Minimum number of vLLM pods (set to 1).maxReplicaCount
: Maximum number of vLLM pods (set to 2).cooldownPeriod
: Wait time before scaling down.pollingInterval
: Interval to check the metric value.threshold
: The GPU KV cache usage threshold for scaling. In this example:0.003 represents 0.3% (Prometheus metric returns values between 0 to 1).
Adjust this value based on your scaling policy, e.g., 0.5 for 50%.
Prometheus Query Example:
vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}
Notes: metrics to scale need to be provided by the customer and inform the metrics endpoint (e.g. http://<SERVICE_IP>:<SERVICE_PORT>/metrics
) to the Lintasarta operation team.
Recommendations
Start with a conservative threshold (e.g., 50% or 0.5 in Prometheus) and monitor the behavior.
Adjust maxReplicaCount according to your GPU and cluster capacity.
Keep an eye on GPU KV cache metrics to avoid cache evictions and performance degradation.
Combine this autoscaler with proper request queueing and batching logic in vLLM to maximize GPU efficiency.
These metrics are exposed via the /metrics endpoint on the vLLM OpenAI compatible API server.
Last updated