Skip to content

Deploy NKP Clusters

This section will take you through install NKP(Kubernetes) on Nutanix cluster as we will be deploying AI applications on these kubernetes clusters.

We will use the CAPI based deployment of NKP. This will automatically deploy the required infrastructure VMs for the cluster by connecting to Nutanix Cluster APIs. There is no requirement to use Terraform or or other IaC tools to deploy NKP.

This section will expand to other available Kubernetes implementations on Nutanix.

stateDiagram-v2
    direction LR

    state DeployNKP {
        [*] --> CreateNkpMachineImage
        CreateNkpMachineImage --> CreateNKPCluster
        CreateNKPCluster --> GenerateLicense
        GenerateLicense --> InstallLicense
        InstallLicense --> DeployGpuNodePool
        DeployGpuNodePool --> EnableGpuOperator
        EnableGpuOperator --> [*]
    }

    PrepWorkstation --> DeployJumpHost 
    DeployJumpHost --> DeployNKP 
    DeployNKP --> DeployNai : Next section

Deploying NKP Cluster

This lab will focus on deploying NKP to host NAI workloads. However, the steps can also be used deploy a custom NKP deployment if that's the aim.

Consider using NKP The Hard Way section to create a customized version of your NKP cluster.

Once you have determined the resource requirements for a custom NKP deployment, modify the environment variables and values in the .env file to suit your resource needs for your NKP cluster.

NKP High Level Cluster Design

The nkpdev cluster will be hosting the LLM model serving endpoints and AI application stack. This cluster and will require a dedicated GPU node pool.

Sizing Requirements

Below are the sizing requirements needed to successfully deploy NAI on a NKP Cluster (labeled as nkpdev) and subsequently deploying single LLM inferencing endpoint on NAI using the meta-llama/Meta-Llama-3-8B-Instruct LLM model.

Calculating GPU Resources Tips

The calculations below assume that you're already aware of how much memory is required to load target LLM model.

For a general example:

  • To host a 8b(illion) parameter model, multiply the parameter number by 2 to get minimum GPU memory requirements. e.g. 16GB of GPU memory is required for 8b parameter model.

So in the case of the meta-llama/Meta-Llama-3-8B-Instruct model, you'll need a min. 16 GiB GPU vRAM available

Below are additional sizing consideration "Rule of Thumb" for further calculating min. GPU node resources:

  • For each GPU node will have 8 CPU cores, 24 GB of memory, and 300 GB of disk space.
  • For each GPU attached to the node, add 16 GiB of memory.
  • For each endpoint attached to the node, add 8 CPU cores.
  • If a model needs multiple GPUs, ensure all GPUs are attached to the same worker node
  • For resiliency, while running multiple instances of the same endpoint, ensure that the GPUs are on different worker nodes.

Since we will be testing with the meta-llama/Meta-Llama-3-8B-Instruct HuggingFace model, we will require a GPU with a min. of 24 GiB GPU vRAM available to support this demo.

Note

GPU min. vRAM should be 24 GB, such as NVIDIA L4 Model.

Below are minimum requirements for deploying NAI on the NKP Demo Cluster.

Role No. of Nodes (VM) vCPU per Node Memory per Node Storage per Node Total vCPU Total Memory
Control plane 3 4 16 GB 150 GB 12 48 GB
Worker 4 8 32 GB 150 GB 32 128 GB
GPU 1 16 40 GB 300 GB 16 40 GB
Totals 60 216 GB

Pre-requisites for NKP Deployment

  1. Existing Ubuntu Linux jumphost VM. See here for jumphost installation steps.
  2. Docker or Podman installed on the jumphost VM
  3. Nutanix PC is at least 2024.1
  4. Nutanix AOS is at least 6.5,6.8+
  5. Download and install nkp binary from Nutanix Portal
  6. Find and reserve 3 IPs for control plane and MetalLB access from AHV network
  7. Find GPU details from Nutanix cluster
  8. Create a base image to use with NKP nodes using nkp command

Install NKP Binaries

  1. Login to Nutanix Portal using your credentials
  2. Go to Downloads > Nutanix Kubernetes Platform (NKP)
  3. Select NKP for Linux and copy the download link to the .tar.gz file
  4. If you haven't already done so, Open new VSCode window on your jumphost VM

  5. In VSCode Explorer pane, click on existing $HOME folder

  6. Click on New Folder name it: nkp

  7. On VSCode Explorer plane, click the $HOME/nkp folder

  8. On VSCode menu, select Terminal > New Terminal

  9. Browse to nkp directory

    cd $HOME/nkp
    
  10. Download and extract the NKP binary from the link you copied earlier

    Paste the download URL within double quotes
    curl -o nkp_v2.14.0_linux_amd64.tar.gz "_paste_download_URL_here"
    
    curl -o nkp_v2.14.0_linux_amd64.tar.gz "https://download.nutanix.com/downloads/nkp/v2.14.0/nkp_v2.14.0_linux_amd64.tar.gz?Expires=1729016864&........"
    
    tar xvfz nkp_v2.14.0_linux_amd64.tar
    
  11. Move the nkp binary to a directory that is included in your PATH environment variable

    sudo cp nkp /usr/local/bin/
    
  12. Verify the nkp binary is installed correctly. Ensure the version is latest

    Note

    At the time of writing this lab nkp version is v2.14.0

    nkp version
    
    $ nkp version
    diagnose: v0.10.1
    imagebuilder: v0.22.3
    kommander: v2.14.0
    konvoy: v2.14.0
    mindthegap: v1.16.0
    nkp: v2.14.0
    

Setup Docker on Jumphost

If not already done, follow the steps in Setup Docker on Jumphost section.

Reserve Control Plane and MetalLB IP

Nutanix AHV IPAM network allows you to black list IPs that needs to be reserved for specific application endpoints. We will use this feature to find and reserve three IPs.

We will reserve a total of three IPs for the following:

Cluster Role Cluster Name NKP NAI
Dev nkpdev 2 1
  1. Get the CIDR range for the AHV network(subnet) where the application will be deployed

    CIDR example for your Nutanix cluster
    10.x.x.0/24
    
  2. From VSC, logon to your jumpbox VM and open Terminal

  3. Install nmap tool (if not already done)

    cd $HOME/sol-cnai-infra
    devbox add nmap
    
  4. Find three unused static IP addresses in the subnet

    nmap -v -sn  <your CIDR>
    
    nmap -v -sn 10.x.x.0/24
    
    Sample output - choose the first three consecutive IPs
    Nmap scan report for 10.x.x.214 [host down]
    Nmap scan report for 10.x.x.215 [host down]
    Nmap scan report for 10.x.x.216 [host down]
    Nmap scan report for 10.x.x.217
    Host is up (-0.098s latency).
    
  5. Logon to any CVM in your Nutanix cluster and execute the following to add chosen static IPs to the AHV IPAM network

    • Username: nutanix
    • Password: your Prism Element password
    acli net.add_to_ip_blacklist <your-ipam-ahv-network> \
    ip_list=10.x.x.214,10.x.x.215,10.x.x.216
    
    acli net.add_to_ip_blacklist User1 \
    ip_list=10.x.x.214,10.x.x.215,10.x.x.216
    

Reservation of IPs

Reserve the firs IPs for NKP control plane Reserve the second two IPs for MetalLB distributed load balancer - We will use one of these IP for NAI

Reserve the third IP for NAI. We will use the NAI IP in the next NAI section to assign the FDQN and install SSL certificate.

Component IP FQDN
NKP Control Plane VIP 10.x.x.214 -
NKP MetalLB IP Range 10.x.x.215-10.x.x.216 -
NAI 10.x.x.216 nai.10.x.x.216.nip.io

Create Base Image for NKP

About NKP Base Image OS Version on Nutanix Cluster

The base image for NKP is a minimal image that contains the required packages and tools to run the Kubernetes cluster. The base image is used to create the worker node VMs and the control plane VMs.

NKP base image can be Rocky Linux 9.4 image and is part of NKP Starter license. This image is maintained and supported by Nutanix. The image is updated regularly to include the latest security patches and bug fixes. Customers should not modify the base image.

Using NKP Pro license also offers choice of using Ubuntu 22.04 base image for GPU based workload deployments.

In this section we will go through creating a base image for all the control plane and worker node VMs on Nutanix. We will use the Ubuntu 22.04 image as the base image as we will need GPU support for AI applications. NVIDIA GPU drivers are not yet available for Rocky Linux 9.4 base image.

NKP Cloud Support

For information about other supported operating systems for Nutanix Kubernetes Platform (NKP), see NKP Cloud Support.

  1. In VSC Explorer pane, Click on New Folder

  2. Call the folder nkp under $HOME directory

  3. In the nkp folder, click on New File and create new file with the following name:

    .env
    
  4. Run the following command to generate an new RSA key pair on the jumphost VM. This SSH key pair will be used for authentication between the jumphost and NKP K8S cluster nodes.

    Do you have existing SSH key pair?

    Copy the key pair from your workstation (PC/Mac) to ~/.ssh/ directory on your Jumphost VM.

    mac/pc $ scp ~/.ssh/id_rsa.pub ubuntu@10.x.x.171:~/.ssh/id_rsa.pub
    mac/pc $ scp ~/.ssh/id_rsa ubuntu@10.x.x.171:~/.ssh/id_rsa
    
    ssh-keygen -t rsa
    

    Accept the default file location as ~/.ssh/id_rsa

    SSH key pair will stored in the following location:

    ~/.ssh/id_rsa.pub 
    ~/.ssh/id_rsa
    
  5. Fill the following values inside the .env file

    export NUTANIX_USER=_your_nutanix_username
    export NUTANIX_PASSWORD=_your_nutanix_password
    export NUTANIX_ENDPOINT=_your_prism_central_fqdn
    export NUTANIX_CLUSTER=_your_prism_element_cluster_name
    export NUTANIX_SUBNET_NAME=_your_ahv_ipam_network_name
    export STORAGE_CONTAINER=_your_storage_container_nmae
    export SSH_PUBLIC_KEY=_path_to_ssh_pub_key_on_jumphost_vm
    export NKP_CLUSTER_NAME=_your_nkp_cluster_name
    export CONTROLPLANE_VIP=_your_nkp_cluster_controlplane_ip
    export LB_IP_RANGE=_your_range_of_two_ips
    
    export NUTANIX_USER=admin
    export NUTANIX_PASSWORD=xxxxxxxx
    export NUTANIX_ENDPOINT=pc.example.com
    export NUTANIX_CLUSTER=pe
    export NUTANIX_SUBNET_NAME=User1
    export STORAGE_CONTAINER=default
    export SSH_PUBLIC_KEY=$HOME/.ssh/id_rsa.pub
    export NKP_CLUSTER_NAME=nkpdev
    export CONTROLPLANE_VIP=10.x.x.214
    export LB_IP_RANGE=10.x.x.215-10.x.x.216
    
  6. Using VSC Terminal, load the environment variables and its values

    source $HOME/nkp/.env
    
  7. Create the base image and upload to Prism Central using the following command.

    Note

    Image creation will take up to 5 minutes.

    nkp create image nutanix ubuntu-22.04 \
      --endpoint ${NUTANIX_ENDPOINT} --cluster ${NUTANIX_CLUSTER} \
      --subnet ${NUTANIX_SUBNET_NAME} --insecure
    
    nkp create image nutanix ubuntu-22.04 \
      --endpoint pc.example.com --cluster pe \
      --subnet User1 --insecure
    
    nkp create image nutanix ubuntu-22.04 \
      --endpoint pc.example.com --cluster pe \
      --subnet User1 --insecure
    
    > Provisioning and configuring image
    Manifest files extracted to $HOME/nkp/.nkp-image-builder-3243021807
    nutanix.kib_image: output will be in this color.
    
    ==> nutanix.kib_image: Creating Packer Builder virtual machine...
        nutanix.kib_image: Virtual machine nkp-ubuntu-22.04-1.29.6-20240717082720 created
        nutanix.kib_image: Found IP for virtual machine: 10.x.x.234
    ==> nutanix.kib_image: Running post-processor: packer-manifest (type manifest)
    
    ---> 100%
    Build 'nutanix.kib_image' finished after 4 minutes 55 seconds.
    ==> Wait completed after 4 minutes 55 seconds
    
    ==> Builds finished. The artifacts of successful builds are:
    --> nutanix.kib_image: nkp-ubuntu-22.04-1.29.6-20240717082720
    --> nutanix.kib_image: nkp-ubuntu-22.04-1.29.6-20240717082720
    

    Image name - This will be different in your environment

    Note image name from the previous nkp create image command output

    ==> Builds finished. The artifacts of successful builds are:
    --> nutanix.kib_image: nkp-ubuntu-22.04-1.31.4-20250320042646
    

    Warning

    Make sure to use image name that is generated in your environment for the next steps.

  8. Populate the .env file with the NKP image name by adding (appending) the following environment variables and save it

    export NKP_IMAGE=nkp-image-name
    
    export NKP_IMAGE=nkp-ubuntu-22.04-1.31.4-20250320042646
    

We are now ready to install the workload nkpdev cluster

Create NKP Workload Cluster

  1. Open .env file in VSC and add (append) the following environment variables to your .env file and save it

    export CONTROL_PLANE_REPLICAS=_no_of_control_plane_replicas
    export CONTROL_PLANE_VCPUS=_no_of_control_plane_vcpus
    export CONTROL_PLANE_CORES_PER_VCPU=_no_of_control_plane_cores_per_vcpu
    export CONTROL_PLANE_MEMORY_GIB=_no_of_control_plane_memory_gib
    export WORKER_REPLICAS=_no_of_worker_replicas
    export WORKER_VCPUS=_no_of_worker_vcpus
    export WORKER_CORES_PER_VCPU=_no_of_worker_cores_per_vcpu
    export WORKER_MEMORY_GIB=_no_of_worker_memory_gib
    export CSI_FILESYSTEM=_preferred_filesystem_ext4/xfs
    export CSI_HYPERVISOR_ATTACHED=_true/false
    export DOCKER_USERNAME=_your_docker_username
    export DOCKER_PASSWORD=_your_docker_password
    export NUTANIX_PROJECT_NAME=_your_pc_project_name
    
    export CONTROL_PLANE_REPLICAS=3
    export CONTROL_PLANE_VCPUS=4
    export CONTROL_PLANE_CORES_PER_VCPU=1
    export CONTROL_PLANE_MEMORY_GIB=16
    export WORKER_REPLICAS=4
    export WORKER_VCPUS=8 
    export WORKER_CORES_PER_VCPU=1
    export WORKER_MEMORY_GIB=32
    export CSI_FILESYSTEM=ext4
    export CSI_HYPERVISOR_ATTACHED=true
    export DOCKER_USERNAME=_your_docker_username
    export DOCKER_PASSWORD=_your_docker_password
    export NUTANIX_PROJECT_NAME=dev-lab
    
  2. Source the new variables and values to the environment

    source $HOME/nkp/.env
    
  3. In VSC, open Terminal, enter the following command to create the workload cluster

    Check your command for correct argument values

    Run the following command to verify your nkp command and associated environment variables and values.

    echo "nkp create cluster nutanix -c ${NKP_CLUSTER_NAME} \
            --control-plane-endpoint-ip ${CONTROLPLANE_VIP} \
            --control-plane-prism-element-cluster ${NUTANIX_CLUSTER} \
            --control-plane-subnets ${NUTANIX_SUBNET_NAME} \
            --control-plane-vm-image ${NKP_IMAGE} \
            --csi-storage-container ${STORAGE_CONTAINER} \
            --endpoint https://${NUTANIX_ENDPOINT}:9440 \
            --worker-prism-element-cluster ${NUTANIX_CLUSTER} \
            --worker-subnets ${NUTANIX_SUBNET_NAME} \
            --worker-vm-image ${NKP_IMAGE} \
            --ssh-public-key-file ${SSH_PUBLIC_KEY} \
            --kubernetes-service-load-balancer-ip-range ${LB_IP_RANGE} \
            --control-plane-disk-size 150 \
            --control-plane-memory ${CONTROL_PLANE_MEMORY_GIB} \
            --control-plane-vcpus ${CONTROL_PLANE_VCPUS} \
            --control-plane-cores-per-vcpu ${CONTROL_PLANE_CORES_PER_VCPU} \
            --worker-disk-size 150 \
            --worker-memory ${WORKER_MEMORY_GIB} \
            --worker-vcpus ${WORKER_VCPUS} \
            --worker-cores-per-vcpu ${WORKER_CORES_PER_VCPU} \
            --csi-file-system ${CSI_FILESYSTEM} \
            --csi-hypervisor-attached-volumes=${CSI_HYPERVISOR_ATTACHED} \
            --registry-mirror-url "https://registry-1.docker.io" \
            --registry-mirror-username ${DOCKER_USERNAME} \
            --registry-mirror-password ${DOCKER_PASSWORD} \
            --control-plane-pc-project ${NUTANIX_PROJECT_NAME} \
            --worker-pc-project ${NUTANIX_PROJECT_NAME} \
            --self-managed \
            --insecure"
    

    If the values are incorrect, add the correct values to .env and source the again by running the following command

    source $HOME/nkp/.env
    

    Then rerun the echo nkp command to verify the values again before running the nkp create cluster nutanix command.

    nkp create cluster nutanix -c ${NKP_CLUSTER_NAME} \
        --control-plane-endpoint-ip ${CONTROLPLANE_VIP} \
        --control-plane-prism-element-cluster ${NUTANIX_CLUSTER} \
        --control-plane-subnets ${NUTANIX_SUBNET_NAME} \
        --control-plane-vm-image ${NKP_IMAGE} \
        --csi-storage-container ${STORAGE_CONTAINER} \
        --endpoint https://${NUTANIX_ENDPOINT}:9440 \
        --worker-prism-element-cluster ${NUTANIX_CLUSTER} \
        --worker-subnets ${NUTANIX_SUBNET_NAME} \
        --worker-vm-image ${NKP_IMAGE} \
        --ssh-public-key-file ${SSH_PUBLIC_KEY} \
        --kubernetes-service-load-balancer-ip-range ${LB_IP_RANGE} \
        --control-plane-disk-size 150 \
        --control-plane-memory ${CONTROL_PLANE_MEMORY_GIB} \
        --control-plane-vcpus ${CONTROL_PLANE_VCPUS} \
        --control-plane-cores-per-vcpu ${CONTROL_PLANE_CORES_PER_VCPU} \
        --worker-disk-size 150 \
        --worker-memory ${WORKER_MEMORY_GIB} \
        --worker-vcpus ${WORKER_VCPUS} \
        --worker-cores-per-vcpu ${WORKER_CORES_PER_VCPU} \
        --csi-file-system ${CSI_FILESYSTEM} \
        --csi-hypervisor-attached-volumes=${CSI_HYPERVISOR_ATTACHED} \
        --registry-mirror-url "https://registry-1.docker.io" \
        --registry-mirror-username ${DOCKER_USERNAME} \
        --registry-mirror-password ${DOCKER_PASSWORD} \
        --control-plane-pc-project ${NUTANIX_PROJECT_NAME} \
        --worker-pc-project ${NUTANIX_PROJECT_NAME} \
        --self-managed \
        --insecure
    
    > βœ“ Creating a bootstrap cluster 
    βœ“ Upgrading CAPI components 
    βœ“ Waiting for CAPI components to be upgraded 
    βœ“ Initializing new CAPI components 
    βœ“ Creating ClusterClass resources 
    βœ“ Creating ClusterClass resources
    > Generating cluster resources
    cluster.cluster.x-k8s.io/nkpdev created
    secret/nkpdev-pc-credentials created
    secret/nkpdev-pc-credentials-for-csi created
    secret/nkpdev-image-registry-credentials created
    βœ“ Waiting for cluster infrastructure to be ready 
    βœ“ Waiting for cluster control-planes to be ready 
    βœ“ Waiting for machines to be ready
    βœ“ Initializing new CAPI components 
    βœ“ Creating ClusterClass resources 
    βœ“ Moving cluster resources
    
    > You can now view resources in the moved cluster by using the --kubeconfig flag with kubectl.
    For example: kubectl --kubeconfig="$HOME/nkp/nkpdev.conf" get nodes
    
    > βœ“ Deleting bootstrap cluster 
    
    Cluster default/nkpdev kubeconfig was written to to the filesystem.
    You can now view resources in the new cluster by using the --kubeconfig flag with kubectl.
    For example: kubectl --kubeconfig="$HOME/nkp/nkpdev.conf" get nodes
    
    > Starting kommander installation
    βœ“ Deploying Flux 
    βœ“ Deploying Ingress certificate 
    βœ“ Creating kommander-overrides ConfigMap
    βœ“ Deploying Git Operator 
    βœ“ Creating GitClaim for management GitRepository 
    βœ“ Creating GitClaimUser for accessing management GitRepository 
    βœ“ Creating HTTP Proxy configuration
    βœ“ Deploying Flux configuration
    βœ“ Deploying Kommander Operator 
    βœ“ Creating KommanderCore resource 
    βœ“ Cleaning up kommander bootstrap resources
    βœ“ Deploying Substitution variables
    βœ“ Deploying Flux configuration 
    βœ“ Deploying Gatekeeper 
    βœ“ Deploying Kommander AppManagement 
    βœ“ Creating Core AppDeployments 
    βœ“ 4 out of 12 core applications have been installed (waiting for dex, dex-k8s-authenticator and 6 more) 
    βœ“ 5 out of 12 core applications have been installed (waiting for dex-k8s-authenticator, kommander and 5 more) 
    βœ“ 7 out of 12 core applications have been installed (waiting for dex-k8s-authenticator, kommander and 3 more) 
    βœ“ 8 out of 12 core applications have been installed (waiting for dex-k8s-authenticator, kommander-ui and 2 more) 
    βœ“ 9 out of 12 core applications have been installed (waiting for dex-k8s-authenticator, kommander-ui and 1 more) 
    βœ“ 10 out of 12 core applications have been installed (waiting for dex-k8s-authenticator, traefik-forward-auth-mgmt) 
    βœ“ 11 out of 12 core applications have been installed (waiting for traefik-forward-auth-mgmt) 
    βœ“ Creating cluster-admin credentials
    
    > Cluster was created successfully! Get the dashboard details with:
    > nkp get dashboard --kubeconfig="$HOME/nkp/nkpdev.conf"
    

    What is a Self-Manged Cluster?

    The --self-managed argument of the nkp create cluster nutanix command will deploy bootstrap, and Kommander management automatically.

    The appendix section has information on how to deploy a cluster without using the --self-managed option.

    Usually preferred by customer DevOps teams to have more control over the deployment process. This way the customer can do the following:

    • Deploy bootstrap (kind) cluster
    • Deploy NKP Management cluster
    • Choose to migrate the CAPI components over to NKP Management cluster
    • Choose to customize Kommander Managment component instllation
    • Choose to deploy workload clusters from NKP Kommander GUI or
    • Choose to deploy workload clusters using scripts if they wish to automate the process

    See NKP the Hard Way section for more information for customizable NKP cluster deployments.

  4. Observe the events in the shell and in Prism Central events

  5. Store kubeconfig file for bootstrap cluster

    kind get kubeconfig --name konvoy-capi-bootstrapper > bs.cfg
    export KUBECONFIG=bs.cfg
    
  6. Store kubeconfig files for the workload cluster

    nkp get kubeconfig -c ${NKP_CLUSTER_NAME} > ${NKP_CLUSTER_NAME}.cfg
    
  7. Combine the bootstrap and workload clusters KUBECONFIG file so that we can use it with kubectxcommand to change context between clusters

    export KUBECONFIG=bs.cfg:${NKP_CLUSTER_NAME}.cfg
    kubectl config view --flatten > all-in-one-kubeconfig.yaml
    export KUBECONFIG=all-in-one-kubeconfig.yaml
    
  8. Run the following command to check K8S status of the nkpdev cluster

    kubectx ${NKP_CLUSTER_NAME}-admin@${NKP_CLUSTER_NAME} 
    kubectl get nodes
    
    $ kubectl get nodes
    
    NAME                                  STATUS   ROLES           AGE     VERSION
    nkpdev-md-0-x948v-hvxtj-9r698           Ready    <none>          4h49m   v1.29.6
    nkpdev-md-0-x948v-hvxtj-fb75c           Ready    <none>          4h50m   v1.29.6
    nkpdev-md-0-x948v-hvxtj-mdckn           Ready    <none>          4h49m   v1.29.6
    nkpdev-md-0-x948v-hvxtj-shxc8           Ready    <none>          4h49m   v1.29.6
    nkpdev-r4fwl-8q4ch                      Ready    control-plane   4h50m   v1.29.6
    nkpdev-r4fwl-jf2s8                      Ready    control-plane   4h51m   v1.29.6
    nkpdev-r4fwl-q888c                      Ready    control-plane   4h49m   v1.29.6
    

Add NKP GPU Workload Pool

Are you just deploying NKP?

If you are doing this lab only to deploy NKP, then you can skip this GPU section.

The steps below covers the following: - Retrieving and Applying NKP Pro License - Identifying the GPU device name - Deploying the GPU nodepool - Enabling the NVIDIA GPU Operator

Note

To Enable the GPU Operator afterwards using the NKP Marketplace, a minimal NKP Pro license is required.

Find GPU Device Details

As we will be deploying Nutanix Enterprise AI (NAI) in the next section, we need to find the GPU details beforehand.

Find the details of GPU on the Nutanix cluster while still connected to Prism Central (PC).

  1. Logon to Prism Central GUI
  2. On the general search, type GPUs
  3. Click on the GPUs result

  4. Lovelace 40s is the GPU available for use

  5. Use Lovelace 40s in the evironment variables in the next section.

Create NKP GPU Workload Pool

In this section we will create a nodepool to host the AI apps with a GPU.

  1. Open .env file in VSC and add (append) the following environment variables to your .env file and save it

    export GPU_NAME=_name_of_gpu_device_
    export GPU_REPLICA_COUNT=_no_of_gpu_worker_nodes
    export GPU_POOL=_name_of_gpu_pool
    export GPU_NODE_VCPUS=_no_of_gpu_node_vcpus
    export GPU_NODE_CORES_PER_VCPU=_per_gpu_node_cores_per_vcpu
    export GPU_NODE_MEMORY_GIB=_per_gpu_node_memory_gib
    export GPU_NODE_DISK_SIZE_GIB=_per_gpu_node_memory_gib
    
    export GPU_NAME="Lovelace 40S"
    export GPU_REPLICA_COUNT=1
    export GPU_POOL=gpu-nodepool
    export GPU_NODE_VCPUS=16
    export GPU_NODE_CORES_PER_VCPU=1
    export GPU_NODE_MEMORY_GIB=40
    export GPU_NODE_DISK_SIZE_GIB=200
    
  2. Source the new variables and values to the environment

    source $HOME/nkp/.env
    
  3. Run the following command to create a GPU nodepool manifest

    nkp create nodepool nutanix \
        --cluster-name ${NKP_CLUSTER_NAME} \
        --prism-element-cluster ${NUTANIX_CLUSTER} \
        --pc-project ${NUTANIX_PROJECT_NAME} \
        --subnets ${NUTANIX_SUBNET_NAME} \
        --vm-image ${NKP_IMAGE} \
        --disk-size ${GPU_NODE_DISK_SIZE_GIB} \
        --memory ${GPU_NODE_MEMORY_GIB} \
        --vcpus ${GPU_NODE_VCPUS} \
        --cores-per-vcpu ${GPU_NODE_CORES_PER_VCPU} \
        --replicas ${GPU_REPLICA_COUNT} \
        --wait \
        ${GPU_POOL} --dry-run -o yaml > gpu-nodepool.yaml
    

    Note

    Right now there is no switch for GPU in nkp command. We need to do dry-run the output into a file and then add the necessary GPU specifications

  4. Add the necessary gpu section to our new gpu-nodepool.yaml using yq command

    yq e '(.spec.topology.workers.machineDeployments[] | select(.name == "gpu-nodepool").variables.overrides[] | select(.name == "workerConfig").value.nutanix.machineDetails) += {"gpus": [{"type": "name", "name": strenv(GPU_NAME)}]}' -i gpu-nodepool.yaml
    
    Successful addtion of GPU specs?

    You would be able to see the added gpu section at the end of the gpu-nodepool.yaml file

    apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    
    <snip>
    
      name: gpu-nodepool
      variables:
        overrides:
          - name: workerConfig
            value:
              nutanix:
                machineDetails:
                  bootType: legacy
                  cluster:
                    name: romanticism
                    type: name
                  image:
                    name: nkp-ubuntu-22.04-1.29.6-20240718055804
                    type: name
                  memorySize: 40Gi
                  subnets:
                    - name: User1
                      type: name
                  systemDiskSize: 200Gi
                  vcpuSockets: 16
                  vcpusPerSocket: 1
                  gpus:
                    - type: name
                      name: Lovelace 40S
    
  5. Monitor Cluster-Api resources to ensure gpu machine will be successful

    watch kubectl get cluster-api
    
    NAME                                                          CLUSTER   REPLICAS   READY   UPDATED   UNA
    VAILABLE   PHASE       AGE    VERSION
    machinedeployment.cluster.x-k8s.io/nkplb-gpu-nodepool-mpr4d   nkplb     1                  1         1
               ScalingUp   12s    v1.31.4
    machinedeployment.cluster.x-k8s.io/nkplb-md-0-d6cm7           nkplb     4          4       4         0
               Running     159m   v1.31.4
    
  6. Apply the gpu-nodepool.yaml file to the workload cluster

    kubectl apply -f gpu-nodepool.yaml
    
  7. Monitor the progress of the command and check Prism Central events for creation of the GPU worker node

    Change to workload nkpdev cluster context

    kubectx ${NKP_CLUSTER_NAME}-admin@${NKP_CLUSTER_NAME}
    
  8. Check nodes status in workload nkpdev cluster and note the gpu worker node

    kubectl get nodes -w
    
    $ kubectl get nodes
    
    NAME                                   STATUS   ROLES           AGE     VERSION
    nkpdev-gpu-nodepool-7g4jt-2p7l7-49wvd   Ready    <none>          5m57s   v1.29.6
    nkpdev-md-0-q679c-khl2n-9k7jk           Ready    <none>          74m     v1.29.6
    nkpdev-md-0-q679c-khl2n-9nk6h           Ready    <none>          74m     v1.29.6
    nkpdev-md-0-q679c-khl2n-nf9p6           Ready    <none>          73m     v1.29.6
    nkpdev-md-0-q679c-khl2n-qgxp9           Ready    <none>          74m     v1.29.6
    nkpdev-ncnww-2dg7h                      Ready    control-plane   73m     v1.29.6
    nkpdev-ncnww-bbm4s                      Ready    control-plane   72m     v1.29.6
    nkpdev-ncnww-hldm9                      Ready    control-plane   75m     v1.29.6
    

Licensing

We need to generate a license for the NKP cluster which is the total for all the vCPUs used by worker nodes.

For example, in the Sizing Requirements section, the NKP Demo Cluster Total vCPU count is equal to 60, whereas the actual worker nodes total vCPU count is only 48.

Generate NKP Pro License

To generate a NKP Pro License for the NKP cluster:

Note

Nutanix Internal users should logon using Nutanix SSO

Nutanix Partners/Customers should logon to Portal using their Nutanix Portal account credentials

  1. Login to Nutanix Portal using your credentials
  2. Go to Licensing > License Summary
  3. Click on the small drop down arrow on Manage Licenses and choose Nutanix Kubernetes Platform (NKP)
  4. Input the NKP cluster name
  5. Click on the plus icon
  6. Click on Next in the bottom right corner
  7. Select NKP Pro License
  8. Select Apply to cluster
  9. Choose Non-production license and Save
  10. Select the cluster name and click on Next
  11. Input the number of vCPU (60) from our calculations in the previous section
  12. Click on Save
  13. Download the csv file and store it in a safe place

Applying NKP Pro License to NKP Cluster

  1. Login to the Kommander URL for nkpdev cluster with the generated credentials that was generated in the previous section. The following commands will give you the credentials and URL.

    nkp get dashboard
    
    nkp get dashboard
    
    Username: recursing_xxxxxxxxx
    Password: YHbPsslIDB7p7rqwnfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    URL: https://10.x.x.215/dkp/kommander/dashboard
    
  2. Go to Licensing and click on Remove License to remove the Starter license

  3. Type nutanix-license in the confirmation box and click on Remove License
  4. Click on Add License, choose Nutanix platform and paste the license key from the previous section
  5. Click on Save
  6. Confirm the license is applied to the cluster by cheking the License Status in the License menu
  7. The license will be applied to the cluster and the license status will reflect NKP Pro in the top right corner of the dashboard

Enable NKE Operators

Enable these NKE Operators from NKP GUI.

Note

In this lab, we will be using the Management Cluster Workspace to deploy our Nutanix Enterprise AI (NAI)

However, in a customer environment, it is recommended to use a separate workload NKP cluster.

  1. In the NKP GUI, Go to Clusters
  2. Click on Management Cluster Workspace
  3. Go to Applications
  4. Search and enable the following operators: follow this order to avoid dependency issues

    • Prometheus Monitoring: version 69.1.2 or later
    • Prometheus Adapter: version v4.11.0 or later
    • Istio Service Mesh: version1.20.8 or later
    • Knative-serving: version 1.13.1 or later

Note

It may take a few minutes for each application to be up and running. Monitor the deployment to make sure that these applications are running before moving on to the next section.

GPU Operator

We will need to enable GPU operator for deploying NAI application.

  1. In the NKP GUI, Go to Clusters
  2. Click on Management Cluster Workspace
  3. Go to Applications
  4. Search for NVIDIA GPU Operator
  5. Click on Enable
  6. Click on Configuration tab
  7. Click on Workspace Application Configuration Override and paste the following yaml content

    driver:
      enabled: true
    

    As shown here:

  8. Click on Enable on the top right-hand corner to enable GPU driver on the Ubuntu GPU nodes

  9. Check GPU operator resources and make sure they are running

    kubectl get po -A | grep -i nvidia
    
    kubectl get po -A | grep -i nvidia
    
    nvidia-container-toolkit-daemonset-fjzbt                          1/1     Running     0          28m
    nvidia-cuda-validator-f5dpt                                       0/1     Completed   0          26m
    nvidia-dcgm-exporter-9f77d                                        1/1     Running     0          28m
    nvidia-dcgm-szqnx                                                 1/1     Running     0          28m
    nvidia-device-plugin-daemonset-gzpdq                              1/1     Running     0          28m
    nvidia-driver-daemonset-dzf55                                     1/1     Running     0          28m
    nvidia-operator-validator-w48ms                                   1/1     Running     0          28m
    
  10. Run a sample GPU workload to confirm GPU operations

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vector-add
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vector-add
        image: k8s.gcr.io/cuda-vector-add:v0.1
        resources:
          limits:
            nvidia.com/gpu: 1
    EOF
    
    pod/cuda-vector-add created
    
  11. Follow the logs to check if the GPU operations are successful

    kubectl logs _gpu_worload_pod_name
    
    kubectl logs cuda-vector-add-xxx
    
    kubectl logs cuda-vector-add
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    

Now we are ready to deploy our AI workloads.