# Create Deployment

Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.

```bash
nano deployment.yaml
```

If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

<figure><img src="/files/BsjbwkBGhwTv6O7RQ0QD" alt="" width="375"><figcaption><p>Text Editor</p></figcaption></figure>

Enter the following syntax.&#x20;

{% code lineNumbers="true" %}

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1
  namespace: vllm
  labels:
    app: deepseek-r1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1
  template:
    metadata:
      labels:
        app: deepseek-r1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-H100-80GB-HBM3
                # - NVIDIA-L40S
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: deepseek-r1
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: deepseek-r1
        image : dekaregistry.cloudeka.id/cloudeka-system/vllm-openai:v0.11.2
        args: [
          "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
          "--trust-remote-code",
          "--gpu-memory-utilization", "0.9",
          "--tensor-parallel-size", "2",
          "--max-model-len", "16000",
        ]
        env:
        - name: HF_HOME
          value: /.cache/huggingface
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 64Gi
            nvidia.com/gpu: "2"
          requests:
            cpu: "8"
            memory: 32Gi
            nvidia.com/gpu: "2"
        volumeMounts:
          - name: cache-volume
            mountPath: /.cache
          - name: shm
            mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 40
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 40
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          allowPrivilegeEscalation: false
      runtimeClassName: nvidia
```

{% endcode %}

{% hint style="warning" %}
There are several lines of syntax above that you have to change.

* You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 38.

```yaml
 image : dekaregistry.cloudeka.id/cloudeka-system/vllm-openai:v0.11.2
```

* The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.

```yaml
args: [
          "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
          "--trust-remote-code",
          "--gpu-memory-utilization", "0.9",
          "--tensor-parallel-size", "2",
          "--max-model-len", "16000",
        ]
```

* Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:

  1. `<model>`: The name of the model to serve. In this case, it is deepseek-ai/DeepSeek-R1-Distill-Llama-70B.
  2. `--trust-remote-code`: Allows executing custom python from model repository.
  3. `--gpu-memory-utilization <float>`: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.9 means 90% of the GPU memory will be used.
  4. `--tensor-parallel-size <int>` : The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 2.
  5. **`--max-model-len <int>`**: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.

  &#x20;  &#x20;
  {% endhint %}

If you are using a **Linux** operating system, run the following syntax but If you are using a **Windows** operating system, after save the file as secret.yaml,  in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.

```bash
kubectl apply -f deployment.yaml
```

{% hint style="warning" %}
To delete the pvc.yaml configuration that has been applied, run the following syntax.&#x20;

```bash
kubectl delete -f deployment.yaml -n [namespace]
```

**Replace \[namespace] with the namespace you created in the sub-chapter** [**Create Namespace**](/reference/deployment-llama-3.1-70b-with-vllm-on-kubernetes/create-namespace.md)**.**
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cloudeka.ai/reference/deployment-deepseek-r1-70b-with-vllm-on-deka-gpus-kubernetes/create-deployment.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
