Page cover

Create Deployment

Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.

nano deployment.yaml

If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

Text Editor

Enter the following syntax.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-31-70b-instruct
  namespace: vllm
  labels:
    app: llama-31-70b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-31-70b-instruct
  template:
    metadata:
      labels:
        app: llama-31-70b-instruct
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-H100-80GB-HBM3
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: llama-31-70b-instruct
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: llama-31-70b-instruct
        image: vllm/vllm-openai:v0.6.4 
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 32Gi
            nvidia.com/gpu: "2" 
          requests:
            cpu: "4"
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /.cache
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 5
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          allowPrivilegeEscalation: false
      runtimeClassName: nvidia

If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.

kubectl apply -f deployment.yaml
Success

Last updated