Create Deployment
Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.
nano deployment.yamlIf you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

Enter the following syntax.
Running the model in 2 GPUs.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-31-70b-instruct
  namespace: vllm
  labels:
    app: llama-31-70b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-31-70b-instruct
  template:
    metadata:
      labels:
        app: llama-31-70b-instruct
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-H100-80GB-HBM3
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: llama-31-70b-instruct
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: llama-31-70b-instruct
        image: vllm/vllm-openai:v0.6.4 
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 32Gi
            nvidia.com/gpu: "2" 
          requests:
            cpu: "4"
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /.cache
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 5
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          allowPrivilegeEscalation: false
      runtimeClassName: nvidiaThere are several lines of syntax above that you have to change.
You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 58.
 image: vllm/vllm-openai:v0.6.4 The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.
args: [
"vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
]Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:
<model>: The name of the model to serve. In this case, it is meta-llama/Llama-3.1-70B-Instruct.--gpu-memory-utilization <float>: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.95 means 95% of the GPU memory will be used.--tensor-parallel-size <int>: The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 4--max-model-len <int>: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.--enforce-eager: Enforces eager execution mode, which can improve performance for certain workloads.
If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.
kubectl apply -f deployment.yaml
To delete the pvc.yaml configuration that has been applied, run the following syntax.
kubectl delete -f deployment.yaml -n [namespace]Replace [namespace] with the namespace you created in the sub-chapter Create Namespace.
Last updated
