Create Deployment
Last updated
Last updated
Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.
If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.
Enter the following syntax.
Running the model in 2 GPUs.
There are several lines of syntax above that you have to change.
You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 58.
The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.
Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:
<model>
: The name of the model to serve. In this case, it is meta-llama/Llama-3.1-70B-Instruct.
--gpu-memory-utilization <float>
: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.95 means 95% of the GPU memory will be used.
--tensor-parallel-size <int>
: The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 4
--max-model-len <int>
: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.
--enforce-eager
: Enforces eager execution mode, which can improve performance for certain workloads.
If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.
To delete the pvc.yaml configuration that has been applied, run the following syntax.
Replace [namespace] with the namespace you created in the sub-chapter Create Namespace.