Page cover

KEDA Configuration Example

Below is a sample ScaledObject definition for KEDA to scale a vLLM deployment based on GPU KV cache metrics collected from Prometheus:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
  namespace: vllm
spec:
  maxReplicaCount: 2
  minReplicaCount: 1
  cooldownPeriod: 120        # Cooldown period in seconds before scale-in
  pollingInterval: 30        # Polling interval for metrics (in seconds)
  scaleTargetRef:
    name: llama-32-3b-instruct
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://l2-jatiluhur.metric.cloudeka.ai
      threshold: '0.5'   # This is equivalent to 50%
      query: vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}

Explanation:

  • minReplicaCount: Minimum number of vLLM pods (set to 1).

  • maxReplicaCount: Maximum number of vLLM pods (set to 2).

  • cooldownPeriod: Wait time before scaling down.

  • pollingInterval: Interval to check the metric value.

  • threshold: The GPU KV cache usage threshold for scaling. In this example:

  • 0.003 represents 0.3% (Prometheus metric returns values between 0 to 1).

  • Adjust this value based on your scaling policy, e.g., 0.5 for 50%.

Prometheus Query Example:

vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}

Notes: metrics to scale need to be provided by the customer and inform the metrics endpoint (e.g. http://<SERVICE_IP>:<SERVICE_PORT>/metrics) to the Lintasarta operation team.

These metrics are exposed via the /metrics endpoint on the vLLM OpenAI compatible API server.

Last updated