Deka GPU Documentations
  • Starter Guide
    • Introduction
    • Sign Up
    • Choose a Package
    • Top Up
    • Create a Virtual Machine
    • Download kubeconfig
    • Create a Deka LLM
    • Create a Deka Notebook
    • Conclusion
  • Service Portal
    • Introduction
    • Sign Up
    • Sign In
    • Sign Out
    • Forgot Password
    • Account Setting
      • Using MFA Google Authenticator
      • Using MFA Microsoft Authenticator
    • Project
      • Add Project
      • Delete Project
    • List Roles
    • Broadcast
    • Audit Log
    • Voucher
    • Security
      • AI Security AI Infrastructure Layer
      • AI Security AI Application Layer
    • Ticket
      • Create Ticket
      • Detail Ticket
    • Billing
      • Daily Cost Estimated
      • Monthly Cost
      • Invoice
      • Summary Monthly
    • Balance
      • Project Type: SME
        • GPU Merdeka
        • Choose Package
        • Top-Up
      • Project Type: Enterprise
      • History Balance
        • Balance
        • Transaction
      • Custom Resource Definition
  • Deka GPU
    • Deka GPU: Kubernetes
      • Introduction
      • GPU Type
      • Dashboard
        • Check Status Kubernetes
        • Download Kube Config
        • Access Console
      • Workloads
        • Pods
          • Create New Pod
          • Access Console
          • Configuration Pod
          • Delete Pod
          • How to Create a New Pod use CLI
        • Deployments
          • Create New Deployment
          • Configuring Deployment
          • Delete of a Deployment
          • How to Create a New Deployment use CLI
        • DaemonSets
          • Create a New DaemonSet
          • Configuring a DaemonSet
          • Delete DaemonSet
      • Services
      • Storages
        • Storage Class
        • Persistent Volume Claims
          • Create a New Persistent Volume Claim
          • How to Create a New Persistent Volume Claim use CLI
    • Deka GPU: VMs
      • Operating System
      • GPU Type
      • Machine Type
      • Namespace Type
      • Storage Class
      • How to Create a Virtual Machine on Service Portal
      • How to Manually Create a Virtual Machine
        • Download Kube Config
        • Running Kube Config
        • Configuration file dv.yaml
        • Configuration file vm.yaml
        • Configuration file svc.yaml
      • Feature Overview of Virtual Machine
        • Detail a Virtual Machine
        • Open Console
        • Turn Off a VM Instance
        • Turn On a VM Instance
        • Restart a Virtual Machine
        • How to Access Console
        • Show YAML File
      • Delete a Virtual Machine
    • Deka GPU: Registry
      • Create Registry
      • Quota
      • Detail Registry
        • Summary
        • Repository
        • Logs
        • Labels
        • Tag Immutability
        • Member
        • Resize Storage Registry
      • Delete Registry
    • Deka GPU: Security
      • Deka Guard
        • Introduction
        • Create Guard to Deny All Ingress
        • Create Guard to Allow Ingress
        • Create Guard to Allow Ingress with port
        • Create Guard to Allow Ingress with IP/CIDR
        • Create Guard to Deny All Egress
        • Create Guard to Allow Egress
        • Create guard to Allow Egress with Port
        • Create Guard to Allow Egress with IP/CIDR
    • Deka GPU: Service
      • Ingress
        • Install Ingress nginx
        • Install Cert Manager
        • Create Cluster Issuer
        • Create Ingress with TLS
    • Deka GPU: Autoscaling
      • Basic Autoscaling
    • Deka GPU: Network
      • Deka VPC
    • Deka GPU: MLOps
      • Introduction
      • Notebook
      • Tensorboards
      • Volumes
      • Endpoints
        • Create Endpoint
        • Delete Endpoint
      • Experiments (AutoML)
        • Create Experiments (AutoML)
        • Create Experiments (AutoML) using Python SDK
        • Get Experiments Results
      • Experiments (KFP)
        • Create Experiment
      • Pipelines
      • Runs
        • Create Run
        • Delete Active Run
      • Recurring Runs
        • Create Recurring Run
        • Delete Recurring Runs
        • Home
      • Artifacts
      • Executions
      • Manage Contributors
  • Deka LLM
    • Introduction
    • Check Project Type
    • Create a New LLM
    • Detail Deka LLM
      • Overview Tab
      • Keys Tab
        • Create a New Key
        • Detail a Key
        • Edit a Key
        • Get a Secret Key
        • Delete a Key
      • Usage Tab
      • Top Up Coin
    • API Deka LLM
      • Model Management
      • Completions
      • Embedding
    • Delete Deka LLM
    • How to Create Simple Prompt with Deka LLM
      • Create Deka LLM
      • Get URL API Deka LLM
      • Get Secret Key
      • Access API Deka LLM using Postman
      • Get Model
      • Post Chat Completions
  • Deka Notebook
    • Introduction
    • Namespace Type
    • Create a New Notebook
    • Detail Deka Notebook
      • Configuration Deka Notebook
      • Start Deka Notebook Service
      • Stop Deka Notebook Service
      • Get Token
      • Login Deka Notebook
      • Logout Deka Notebook
    • Delete Deka Notebook
  • Reference
    • How to use kubeconfig on Linux
    • How to use kubeconfig on Windows
    • Kubernetes Commands for Enhancing Security
    • How to add GPU in Kubernetes
    • How to Add GPU in VM
      • Download kubeconfig
      • Install kubectl
      • Add GPU
      • Install Driver NVIDIA
    • RAPIDS
      • How to Setup RAPIDS
      • How to make Custom Image
    • How to push image with Docker
    • Deployment LLaMA 3.1 70B with VLLM on Kubernetes
      • Getting the Hugging Face API Key
      • Requesting Access to the LLaMA Model
      • Connect Kubernetes on Computer
      • Create Namespace
      • Create PersistentVolumeClaim (PVC)
      • Create Secret for Hugging Face Token
      • Create Deployment
      • Create Service
      • Verify Deployment
      • Accessing the LLaMA Service
      • Troubleshooting
    • How to Get an API Key on NGC
    • Deployment LLM with Deka GPU + NIM
    • Deployment Deepseek R1 70B with VLLM on Deka GPU's Kubernetes
      • Prerequisites
      • Create Namespace
      • Create PersistentVolumeClaim (PVC)
      • Create Deployment
      • Create Service
      • Verify Deployment
      • Accessing the Deepsek Service
      • Troubleshooting
    • How to Upload and Download on FTP Web
  • Troubleshooting
    • Reinstall Driver NVIDIA on Linux
    • NVIDIA Driver Not Detected After Upgrade Kernel
Powered by GitBook
On this page
  1. Reference
  2. Deployment LLaMA 3.1 70B with VLLM on Kubernetes

Create Deployment

PreviousCreate Secret for Hugging Face TokenNextCreate Service

Last updated 4 months ago

Create a deployment to run the LLaMA container. If you are using a Linux operating system, then run the following syntax to create the pvc.yaml file.

nano deployment.yaml

If you are using a Windows operating system, open a text editor such as Notepad or Notepad++.

Enter the following syntax.

Running the model in 2 GPUs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-31-70b-instruct
  namespace: vllm
  labels:
    app: llama-31-70b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-31-70b-instruct
  template:
    metadata:
      labels:
        app: llama-31-70b-instruct
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-H100-80GB-HBM3
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: llama-31-70b-instruct
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: llama-31-70b-instruct
        image: vllm/vllm-openai:v0.6.4 
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 32Gi
            nvidia.com/gpu: "2" 
          requests:
            cpu: "4"
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /.cache
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 240
          periodSeconds: 5
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          allowPrivilegeEscalation: false
      runtimeClassName: nvidia

There are several lines of syntax above that you have to change.

  • You need to adjust the version tag used according to the LLaMA model page. You can adjust on line 58.

 image: vllm/vllm-openai:v0.6.4 
  • The model can be run on 2 GPUs, but you need to reduce max-model-len to 16000. The consequence is that the model can only process 16000 input tokens. To do this, change the args section on line 39 in deployment.yaml as follows.

args: [
"vllm serve meta-llama/Llama-3.1-70B-Instruct --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --max-model-len 16000 --enforce-eager"
]
  • Detailed Parameters for vllm serve. The vllm serve command is used to start the vLLM server. Here are the detailed parameters used in the command:

    1. <model>: The name of the model to serve. In this case, it is meta-llama/Llama-3.1-70B-Instruct.

    2. --gpu-memory-utilization <float>: The GPU memory utilization factor. This controls how much of the available GPU memory is used by the model. A value of 0.95 means 95% of the GPU memory will be used.

    3. --tensor-parallel-size <int> : The number of GPUs to use for tensor parallelism. This helps in distributing the model across multiple GPUs. In this case, it is set to 4

    4. --max-model-len <int>: The maximum length of the input tokens the model can process. Reducing this value can help in running the model with fewer GPUs.

    5. --enforce-eager: Enforces eager execution mode, which can improve performance for certain workloads.

If you are using a Linux operating system, run the following syntax but If you are using a Windows operating system, after save the file as secret.yaml, in CMD navigate to the folder that contains the secret.yaml file and run the following syntax.

kubectl apply -f deployment.yaml

To delete the pvc.yaml configuration that has been applied, run the following syntax.

kubectl delete -f deployment.yaml -n [namespace]

Replace [namespace] with the namespace you created in the sub-chapter .

Create Namespace
Text Editor
Success
Page cover image