# KEDA Configuration Example

Below is a sample ScaledObject definition for KEDA to scale a vLLM deployment based on GPU KV cache metrics collected from Prometheus:

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
  namespace: vllm
spec:
  maxReplicaCount: 2
  minReplicaCount: 1
  cooldownPeriod: 120        # Cooldown period in seconds before scale-in
  pollingInterval: 30        # Polling interval for metrics (in seconds)
  scaleTargetRef:
    name: llama-32-3b-instruct
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://l2-jatiluhur.metric.cloudeka.ai
      threshold: '0.5'   # This is equivalent to 50%
      query: vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}
```

Explanation:

* `minReplicaCount`: Minimum number of vLLM pods (set to 1).
* `maxReplicaCount`: Maximum number of vLLM pods (set to 2).
* `cooldownPeriod`: Wait time before scaling down.
* `pollingInterval`: Interval to check the metric value.
* `threshold`: The GPU KV cache usage threshold for scaling. In this example:
* 0.003 represents 0.3% (Prometheus metric returns values between 0 to 1).
* Adjust this value based on your scaling policy, e.g., 0.5 for 50%.

Prometheus Query Example:

```powerquery
vllm:gpu_cache_usage_perc{job="goto-vllm-exporter", model_name="meta-llama/Llama-3.2-3B-Instruct"}
```

Notes: metrics to scale need to be provided by the customer and inform the metrics endpoint (e.g. `http://<SERVICE_IP>:<SERVICE_PORT>/metrics`)  to the Lintasarta operation team.

{% hint style="success" %}
**Recommendations**

* Start with a conservative threshold (e.g., 50% or 0.5 in Prometheus) and monitor the behavior.
* Adjust maxReplicaCount according to your GPU and cluster capacity.
* Keep an eye on GPU KV cache metrics to avoid cache evictions and performance degradation.
* Combine this autoscaler with proper request queueing and batching logic in vLLM to maximize GPU efficiency.
  {% endhint %}

These metrics are exposed via the /metrics endpoint on the vLLM OpenAI compatible API server.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cloudeka.ai/deka-gpu/deka-gpu-autoscaling/keda-autoscalling/example-autoscaling-vllm-with-keda-based-on-gpu-kv-cache-usage/keda-configuration-example.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
