Create Deployment
Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.
nano deployment.yaml
If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

Enter the following syntax.
Running the model in 2 GPUs.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-31-70b-instruct
namespace: vllm
labels:
app: llama-31-70b-instruct
spec:
replicas: 1
selector:
matchLabels:
app: llama-31-70b-instruct
template:
metadata:
labels:
app: llama-31-70b-instruct
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-H100-80GB-HBM3
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: llama-31-70b-instruct
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: llama-31-70b-instruct
image: vllm/vllm-openai:v0.6.4
command: ["/bin/sh", "-c"]
args: [
"vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "16"
memory: 32Gi
nvidia.com/gpu: "2"
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "2"
volumeMounts:
- mountPath: /.cache
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 240
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 240
periodSeconds: 5
securityContext:
runAsUser: 1000
runAsNonRoot: true
allowPrivilegeEscalation: false
runtimeClassName: nvidia
There are several lines of syntax above that you have to change.
You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 58.
image: vllm/vllm-openai:v0.6.4
The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.
args: [
"vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
]
Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:
<model>
: The name of the model to serve. In this case, it is meta-llama/Llama-3.1-70B-Instruct.--gpu-memory-utilization <float>
: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.95 means 95% of the GPU memory will be used.--tensor-parallel-size <int>
: The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 4--max-model-len <int>
: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.--enforce-eager
: Enforces eager execution mode, which can improve performance for certain workloads.
If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.
kubectl apply -f deployment.yaml

To delete the pvc.yaml configuration that has been applied, run the following syntax.
kubectl delete -f deployment.yaml -n [namespace]
Replace [namespace] with the namespace you created in the sub-chapter Create Namespace.
Last updated