Deploying Nutanix Enterprise AI (NAI) NVD Reference Application
Version 2.0.0
This version of the NAI deployment is based on the Nutanix Enterprise AI (NAI) v2.0.0
release.
stateDiagram-v2
direction LR
state DeployNAI {
[*] --> DeployNAIAdmin
DeployNAIAdmin --> InstallSSLCert
InstallSSLCert --> DownloadModel
DownloadModel --> CreateNAI
CreateNAI --> [*]
}
[*] --> PreRequisites
PreRequisites --> DeployNAI
DeployNAI --> TestNAI : next section
TestNAI --> [*]
Prepare for NAI Deployment
-
Login to VSC on the jumphost VM, append the following environment variables to the
$HOME\airgap-nai\.env
file and save it -
IN VSC,go to Terminal and run the following commands to source the environment variables
-
In
VSCode
, under the newly createdairgap-nai
folder, click on New File and create file with the following name:with the following content:
#!/usr/bin/env bash set -ex set -o pipefail ## Deploy Istio 1.20.8 helm --insecure-skip-tls-verify=true upgrade --install istio-base base --repo ${INTERNAL_REPO} --version=${ISTIO_VERSION} -n istio-system --create-namespace --wait helm --insecure-skip-tls-verify=true upgrade --install istiod istiod --repo ${INTERNAL_REPO} --version=${ISTIO_VERSION} -n istio-system \ --set gateways.securityContext.runAsUser=0 \ --set gateways.securityContext.runAsGroup=0 --wait helm --insecure-skip-tls-verify=true upgrade --install istio-ingressgateway gateway --repo ${INTERNAL_REPO} --version=${ISTIO_VERSION} -n istio-system \ --set securityContext.runAsUser=0 --set securityContext.runAsGroup=0 \ --set containerSecurityContext.runAsUser=0 --set containerSecurityContext.runAsGroup=0 --wait ## Deploy Knative helm --insecure-skip-tls-verify=true upgrade --install knative-serving-crds nai-knative-serving-crds --repo ${INTERNAL_REPO} --version=${KNATIVE_VERSION} -n knative-serving --create-namespace --wait helm --insecure-skip-tls-verify=true upgrade --install knative-serving nai-knative-serving --repo ${INTERNAL_REPO} -n knative-serving --version=${KNATIVE_VERSION} --wait helm --insecure-skip-tls-verify=true upgrade --install knative-istio-controller nai-knative-istio-controller --repo ${INTERNAL_REPO} -n knative-serving --version=${KNATIVE_VERSION} --wait # Patch configurations stored in configmaps kubectl patch configmap config-features -n knative-serving -p '{"data":{"kubernetes.podspec-nodeselector":"enabled"}}' kubectl patch configmap config-autoscaler -n knative-serving -p '{"data":{"enable-scale-to-zero":"false"}}' kubectl patch configmap config-domain -n knative-serving --type merge -p '{"data":{"example.com":""}}' # This patch of config-deployment config map # is necessary in air-gapped environment # kserve will to skip image tag checks # for the self hosted registry if the following is configured kubectl patch configmap config-deployment -n knative-serving --type merge -p '{"data":{"registries-skipping-tag-resolving":"${REGISTRY_HOST"}}' ## Deploy Kserve helm --insecure-skip-tls-verify=true upgrade --install kserve-crd kserve-crd --repo ${INTERNAL_REPO} --version=${KSERVE_VERSION} -n kserve --create-namespace helm --insecure-skip-tls-verify=true upgrade --install kserve kserve --repo ${INTERNAL_REPO} --version=${KSERVE_VERSION} -n kserve \ --set kserve.modelmesh.enabled=false \ --set kserve.controller.image="${REGISTRY_HOST}/nutanix/nai-kserve-controller" \ --set kserve.controller.tag=${KSERVE_VERSION} --wait
-
Run the script from the Terminal
Release "istiod" has been upgraded. Happy Helming! NAME: istiod LAST DEPLOYED: Tue Oct 15 02:01:58 2024 NAMESPACE: istio-system STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: "istiod" successfully installed! NAME: istio-ingressgateway LAST DEPLOYED: Tue Oct 15 02:02:01 2024 NAMESPACE: istio-system STATUS: deployed NAME: knative-serving-crds LAST DEPLOYED: Tue Oct 15 02:02:03 2024 NAMESPACE: knative-serving STATUS: deployed NAME: knative-serving LAST DEPLOYED: Tue Oct 15 02:02:05 2024 NAMESPACE: knative-serving STATUS: deployed NAME: kserve-crd LAST DEPLOYED: Tue Oct 15 02:02:16 2024 NAMESPACE: kserve STATUS: deployed NAME: kserve LAST DEPLOYED: Tue Oct 15 02:02:19 2024 NAMESPACE: kserve STATUS: deployed
-
Validate if the resources are running in the following namespaces.
istio-system
,knative-serving
, andkserve
$ k get po -n istio-system NAME READY STATUS RESTARTS AGE istio-ingressgateway-6675867d85-qzrpq 1/1 Running 0 26d istiod-6d96569c9b-2dww4 1/1 Running 0 26d $ k get po -n kserve NAME READY STATUS RESTARTS AGE kserve-controller-manager-6654f69d5c-45n64 2/2 Running 0 26d $ k get po -n knative-serving NAME READY STATUS RESTARTS AGE activator-58db57894b-g2nx8 1/1 Running 0 26d autoscaler-76f95fff78-c8q9m 1/1 Running 0 26d controller-7dd875844b-4clqb 1/1 Running 0 26d net-istio-controller-57486f879-85vml 1/1 Running 0 26d net-istio-webhook-7ccdbcb557-54dn5 1/1 Running 0 26d webhook-d8674645d-mscsc 1/1 Running 0 26d
Deploy NAI
-
Source the environment variables (if not done so already)
-
In
VSCode
Explorer pane, browse to$HOME/airgap-nai
folder -
Run the following command to create a helm values file:
cat << EOF > ${ENVIRONMENT}-values.yaml ## Image pull secret. This is required for the huggingface image check by the Inference pod as that does not go via the kubelet and does a direct check. imagePullSecret: ## Name of the image pull secret name: nai-iep-secret ## Image registry credentials credentials: registry: ${REGISTRY_URL} username: ${REGISTRY_USERNAME} password: ${REGISTRY_PASSWORD} email: ${REGISTRY_USERNAME}@foobar.com naiApi: naiApiImage: image: ${REGISTRY_HOST}/nutanix/nai-api tag: ${NAI_API_VERSION} supportedRuntimeImage: ${REGISTRY_HOST}/nutanix/nai-kserve-huggingfaceserver:${NAI_KSERVE_HF_SERVER_VERSION} supportedTGIImage: ${REGISTRY_HOST}/nutanix/nai-tgi supportedTGIImageTag: ${NAI_TGI_RUNTIME_VERSION} naiIepOperator: iepOperatorImage: image: ${REGISTRY_HOST}/nutanix/nai-iep-operator tag: ${NAI_API_VERSION} modelProcessorImage: image: ${REGISTRY_HOST}/nutanix/nai-model-processor tag: ${NAI_API_VERSION} naiInferenceUi: naiUiImage: image: ${REGISTRY_HOST}/nutanix/nai-inference-ui tag: ${NAI_API_VERSION} naiDatabase: naiDbImage: image: ${REGISTRY_HOST}/nutanix/nai-postgres:16.1-alpine naiMonitoring: prometheus: image: registry: ${REGISTRY_HOST} repository: prometheus/prometheus tag: ${NAI_PROMETHEUS_VERSION} ## nai-monitoring stack values for nai-monitoring stack deployment in NKE environment naiMonitoring: ## Component scraping node exporter ## nodeExporter: serviceMonitor: enabled: true endpoint: port: http-metrics scheme: http targetPort: 9100 namespaceSelector: matchNames: - kommander serviceSelector: matchLabels: app.kubernetes.io/name: prometheus-node-exporter app.kubernetes.io/component: metrics app.kubernetes.io/version: 1.8.1 ## Component scraping dcgm exporter ## dcgmExporter: podLevelMetrics: true serviceMonitor: enabled: true endpoint: targetPort: 9400 namespaceSelector: matchNames: - kommander serviceSelector: matchLabels: app: nvidia-dcgm-exporter EOF
## Image pull secret. This is required for the huggingface image check by the Inference pod as that does not go via the kubelet and does a direct check. imagePullSecret: ## Name of the image pull secret name: nai-iep-secret ## Image registry credentials credentials: registry: https://harbor.10.x.x.111.nip.io/nkp username: admin password: xxxxxxx email: admin@foobar.com naiApi: naiApiImage: image: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-api tag: v2.0.0 supportedRuntimeImage: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-kserve-huggingfaceserver:v0.14.0 supportedTGIImage: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-tgi supportedTGIImageTag: "2.3.1-825f39d" naiIepOperator: iepOperatorImage: image: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-iep-operator tag: v2.0.0 modelProcessorImage: image: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-model-processor tag: v2.0.0 naiInferenceUi: naiUiImage: image: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-inference-ui tag: v2.0.0 naiDatabase: naiDbImage: image: harbor.10.x.x.111.nip.io/nkp/nutanix/nai-postgres:16.1-alpine naiMonitoring: prometheus: image: registry: harbor.10.x.x.111.nip.io/nkp repository: prometheus/prometheus tag: v2.53.0 # nai-monitoring stack values for nai-monitoring stack deployment in NKE environment naiMonitoring: ## Component scraping node exporter ## nodeExporter: serviceMonitor: enabled: true endpoint: port: http-metrics scheme: http targetPort: 9100 namespaceSelector: matchNames: - kommander serviceSelector: matchLabels: app.kubernetes.io/name: prometheus-node-exporter app.kubernetes.io/component: metrics app.kubernetes.io/version: 1.8.1 ## Component scraping dcgm exporter ## dcgmExporter: podLevelMetrics: true serviceMonitor: enabled: true endpoint: targetPort: 9400 namespaceSelector: matchNames: - kommander serviceSelector: matchLabels: app: nvidia-dcgm-exporter
-
In
VSCode
, Under$HOME/airgap-nai
folder, click on New File and create a file with the following name:with the following content:
-
Run the following command to deploy NAI
$HOME/airgap-nai/nai-deploy.sh + set -o pipefail + helm repo add ntnx-charts https://nutanix.github.io/helm-releases "ntnx-charts" already exists with the same configuration, skipping + helm repo update ntnx-charts Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "ntnx-charts" chart repository Update Complete. ⎈Happy Helming!⎈ helm upgrade --install nai-core ntnx-charts/nai-core --version=$NAI_CORE_VERSION -n nai-system --create-namespace --wait \ --set naiApi.naiApiImage.tag=v1.0.0-rc2 \ --insecure-skip-tls-verify \ -f nkp-values.yaml Release "nai-core" has been upgraded. Happy Helming! NAME: nai-core LAST DEPLOYED: Mon Sep 16 22:07:24 2024 NAMESPACE: nai-system STATUS: deployed REVISION: 7 TEST SUITE: None
-
Verify that the NAI Core Pods are running and healthy
$ kubens nai-system ✔ Active namespace is "nai-system" $ kubectl get po,deploy NAME READY STATUS RESTARTS AGE pod/nai-api-55c665dd67-746b9 1/1 Running 0 5d1h pod/nai-api-db-migrate-fdz96-xtmxk 0/1 Completed 0 40h pod/nai-db-789945b4df-lb4sd 1/1 Running 0 43h pod/nai-iep-model-controller-84ff5b5b87-6jst9 1/1 Running 0 5d8h pod/nai-ui-7fc65fc6ff-clcjl 1/1 Running 0 5d8h pod/prometheus-nai-0 2/2 Running 0 43h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/nai-api 1/1 1 1 5d8h deployment.apps/nai-db 1/1 1 1 5d8h deployment.apps/nai-iep-model-controller 1/1 1 1 5d8h deployment.apps/nai-ui 1/1 1 1 5d8h
Install SSL Certificate
In this section we will install SSL Certificate to access the NAI UI. This is required as the endpoint will only work with a ssl endpoint with a valid certificate.
NAI UI is accessible using the Ingress Gateway.
The following steps show how cert-manager can be used to generate a self signed certificate using the default selfsigned-issuer present in the cluster.
If you are using Public Certificate Authority (CA) for NAI SSL Certificate
If an organization generates certificates using a different mechanism then obtain the certificate + key and create a kubernetes secret manually using the following command:
Skip the steps in this section to create a self-signed certificate resource.
-
Get the Ingress host using the following command:
-
Get the value of
INGRESS_HOST
environment variable -
We will use the command output e.g:
10.x.x.216
as the IP address for NAI as reserved in this section -
Construct the FQDN of NAI UI using nip.io and we will use this FQDN as the certificate's Common Name (CN).
-
Create the ingress resource certificate using the following command:
cat << EOF | k apply -f - apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: nai-cert namespace: istio-system spec: issuerRef: name: selfsigned-issuer kind: ClusterIssuer secretName: nai-cert commonName: nai.${INGRESS_HOST}.nip.io dnsNames: - nai.${INGRESS_HOST}.nip.io ipAddresses: - ${INGRESS_HOST} EOF
-
Create the certificate using the following command
-
Patch the ingress gateway's IP address to the certificate file.
Accessing the UI
-
In a browser, open the following URL to connect to the NAI UI
-
Change the password for the
admin
user -
Login using
admin
user and password.
Download Model
We will download and user llama3 8B model which we sized for in the previous section.
- In the NAI GUI, go to Models
- Click on Import Model from Hugging Face
- Choose the
meta-llama/Meta-Llama-3.1-8B-Instruct
model -
Input your Hugging Face token that was created in the previous section and click Import
-
Provide the Model Instance Name as
Meta-Llama-3.1-8B-Instruct
and click Import -
Go to VSC Terminal to monitor the download
Get jobs in nai-admin namespacekubens nai-admin ✔ Active namespace is "nai-admin" kubectl get jobs NAME COMPLETIONS DURATION AGE nai-c0d6ca61-1629-43d2-b57a-9f-model-job 0/1 4m56s 4m56
Validate creation of pods and PVCkubectl get po,pvc NAME READY STATUS RESTARTS AGE nai-c0d6ca61-1629-43d2-b57a-9f-model-job-9nmff 1/1 Running 0 4m49s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE nai-c0d6ca61-1629-43d2-b57a-9f-pvc-claim Bound pvc-a63d27a4-2541-4293-b680-514b8b890fe0 28Gi RWX nai-nfs-storage <unset> 2d
Verify download of model using pod logskubectl logs -f nai-c0d6ca61-1629-43d2-b57a-9f-model-job-9nmff /venv/lib/python3.9/site-packages/huggingface_hub/file_download.py:983: UserWarning: Not enough free disk space to download the file. The expected file size is: 0.05 MB. The target location /data/model-files only has 0.00 MB free disk space. warnings.warn( tokenizer_config.json: 100%|██████████| 51.0k/51.0k [00:00<00:00, 3.26MB/s] tokenizer.json: 100%|██████████| 9.09M/9.09M [00:00<00:00, 35.0MB/s]<00:30, 150MB/s] model-00004-of-00004.safetensors: 100%|██████████| 1.17G/1.17G [00:12<00:00, 94.1MB/s] model-00001-of-00004.safetensors: 100%|██████████| 4.98G/4.98G [04:23<00:00, 18.9MB/s] model-00003-of-00004.safetensors: 100%|██████████| 4.92G/4.92G [04:33<00:00, 18.0MB/s] model-00002-of-00004.safetensors: 100%|██████████| 5.00G/5.00G [04:47<00:00, 17.4MB/s] Fetching 16 files: 100%|██████████| 16/16 [05:42<00:00, 21.43s/it]:33<00:52, 9.33MB/s] ## Successfully downloaded model_files|██████████| 5.00G/5.00G [04:47<00:00, 110MB/s] Deleting directory : /data/hf_cache
-
Optional - verify the events in the namespace for the pvc creation
$ k get events | awk '{print $1, $3}' 3m43s Scheduled 3m43s SuccessfulAttachVolume 3m36s Pulling 3m29s Pulled 3m29s Created 3m29s Started 3m43s SuccessfulCreate 90s Completed 3m53s Provisioning 3m53s ExternalProvisioning 3m45s ProvisioningSucceeded 3m53s PvcCreateSuccessful 3m48s PvcNotBound 3m43s ModelProcessorJobActive 90s ModelProcessorJobComplete
The model is downloaded to the Nutanix Files pvc
volume.
After a successful model import, you will see it in Active status in the NAI UI under Models menu
Create and Test Inference Endpoint
In this section we will create an inference endpoint using the downloaded model.
- Navigate to Inference Endpoints menu and click on Create Endpoint button
-
Fill the following details:
- Endpoint Name:
llama-8b
- Model Instance Name:
Meta-LLaMA-8B-Instruct
- Use GPUs for running the models :
Checked
- No of GPUs (per instance):
- GPU Card:
NVIDIA-L40S
(or other available GPU) - No of Instances:
1
- API Keys: Create a new API key or use an existing one
- Endpoint Name:
-
Click on Create
-
Monitor the
nai-admin
namespace to check if the services are coming up -
Check the events in the
nai-admin
namespace for resource usage to make sure all$ kubectl get events -n nai-admin --sort-by='.lastTimestamp' | awk '{print $1, $3, $5}' 110s FinalizerUpdate Updated 110s FinalizerUpdate Updated 110s RevisionReady Revision 110s ConfigurationReady Configuration 110s LatestReadyUpdate LatestReadyRevisionName 110s Created Created 110s Created Created 110s Created Created 110s InferenceServiceReady InferenceService 110s Created Created
-
Once the services are running, check the status of the inference service
Troubleshooting Endpoint ISVC
TGI Imange and Self-signed Certificates
Only follow this procedure if this isvc
is not starting up.
KNative Serving Image Tag Checking
From testing, we have identified that KServe module is making sure that there are no container image tag discrepencies, by pulling image using SHA digest. This is done to avoid pulling images that are updated without updating the tag.
We have avoided this behavior by patching the config-deployment
config map in the knative-serving
namespace to skip image tag checking. Check this Prepare for NAI Deployment sectionfor more details.
kubectl patch configmap config-deployment -n knative-serving --type merge -p '{"data":{"registries-skipping-tag-resolving":"${REGISTRY_HOST}"}'
If this procedure was not followed, then the isvc
will not start up.
-
If the
isvc
is not coming up, then explore the events innai-admin
namespace.$ kubectl get isvc NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE llama8b http://llama8b.nai-admin.svc.cluster.local False $ kubectl get events --sort-by='.lastTimestamp' Warning InternalError revision/llama8b-predictor-00001 Unable to fetch image "harbor.10.x.x.111.nip.io/nkp/nutanix/nai-tgi:2.3.1-825f39d": failed to resolve image to digest: Get "https://harbor.10.x.x.111.nip.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
The temporary workaround is to use the TGI images SHA signature from the container registry.
This site will be updated with resolutions for the above issues in the future.
-
Note the above TGI image SHA digest from the container registry.
docker pull harbor.10.x.x.111.nip.io/nkp/nutanix/nai-tgi:2.3.1-825f39d 2.3.1-825f39d: Pulling from nkp/nutanix/nai-tgi Digest: sha256:2df9fab2cf86ab54c2e42959f23e6cfc5f2822a014d7105369aa6ddd0de33006 Status: Image is up to date for harbor.10.x.x.111.nip.io/nkp/nutanix/nai-tgi:2.3.1-825f39d harbor.10.x.x.111.nip.io/nkp/nutanix/nai-tgi:2.3.1-825f39d
-
The SHA digest will look like the following:
-
Create a copy of the
isvc
manifest -
Edit the
isvc
-
Search and replace the
7. After replacing the image's SHA digest, the image value should look as follows:image
tag with the SHA digest from the TGI image. -
Save the
isvc
configuration by writing the changes to the file and exiting the vi editor using:wq!
key combination. -
Verify that the
isvc
is running
This should resolve the issue the issue with the TGI image.
Report Other Issues
If you are facing any other issues, please report them here in the NAI LLM GitHub Repo Issues page.