Create Deployment

Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.

nano deployment.yaml

If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

Enter the following syntax.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1
  namespace: vllm
  labels:
    app: deepseek-r1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1
  template:
    metadata:
      labels:
        app: deepseek-r1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-H100-80GB-HBM3
                # - NVIDIA-L40S
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: deepseek-r1
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: deepseek-r1
        image : dekaregistry.cloudeka.id/cloudeka-system/vllm-openai:v0.11.2
        args: [
          "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
          "--trust-remote-code",
          "--gpu-memory-utilization", "0.9",
          "--tensor-parallel-size", "2",
          "--max-model-len", "16000",
        ]
        env:
        - name: HF_HOME
          value: /.cache/huggingface
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 64Gi
            nvidia.com/gpu: "2"
          requests:
            cpu: "8"
            memory: 32Gi
            nvidia.com/gpu: "2"
        volumeMounts:
          - name: cache-volume
            mountPath: /.cache
          - name: shm
            mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 40
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 40
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          allowPrivilegeEscalation: false
      runtimeClassName: nvidia

There are several lines of syntax above that you have to change.

You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 38.

 image : dekaregistry.cloudeka.id/cloudeka-system/vllm-openai:v0.11.2

The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.

args: [
          "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
          "--trust-remote-code",
          "--gpu-memory-utilization", "0.9",
          "--tensor-parallel-size", "2",
          "--max-model-len", "16000",
        ]

Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:
1. <model>: The name of the model to serve. In this case, it is deepseek-ai/DeepSeek-R1-Distill-Llama-70B.
2. --trust-remote-code: Allows executing custom python from model repository.
3. --gpu-memory-utilization <float>: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.9 means 90% of the GPU memory will be used.
4. --tensor-parallel-size <int> : The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 2.
5. --max-model-len <int>: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.

If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.

kubectl apply -f deployment.yaml

To delete the pvc.yaml configuration that has been applied, run the following syntax.

kubectl delete -f deployment.yaml -n [namespace]

Replace [namespace] with the namespace you created in the sub-chapter Create Namespace.

PreviousCreate PersistentVolumeClaim (PVC)NextCreate Service

Last updated 27 days ago

apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1 namespace: vllm labels: app: deepseek-r1 spec: replicas: 1 selector: matchLabels: app: deepseek-r1 template: metadata: labels: app: deepseek-r1 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.product operator: In values: - NVIDIA-H100-80GB-HBM3 # - NVIDIA-L40S volumes: - name: cache-volume persistentVolumeClaim: claimName: deepseek-r1 - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: deepseek-r1 image : dekaregistry.cloudeka.id/cloudeka-system/vllm-openai:v0.11.2 args: [ "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "--trust-remote-code", "--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2", "--max-model-len", "16000", ] env: - name: HF_HOME value: /.cache/huggingface - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "16" memory: 64Gi nvidia.com/gpu: "2" requests: cpu: "8" memory: 32Gi nvidia.com/gpu: "2" volumeMounts: - name: cache-volume mountPath: /.cache - name: shm mountPath: /dev/shm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 30 failureThreshold: 40 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 30 failureThreshold: 40 securityContext: runAsUser: 1000 runAsNonRoot: true allowPrivilegeEscalation: false runtimeClassName: nvidia