Nutanix Enterprise AI LLM Scheduling Simulator - v1

Visualize how LLM model deployments are scheduled across Kubernetes nodepools based on compute and GPU requirements

LLM Model Configuration

HuggingFace

NVIDIA

Pre-validated only

Engine Resource Multipliers

TGI

CPU

Memory

vLLM

CPU

Memory

NVIDIA NIM (NVIDIA models only)

CPU

Memory

Pod Resource Requirements:

CPU: 16 cores

Memory: 64 GB

GPU: 2 × NVIDIA H100-80G

Storage: 290 GB

Engine: vLLM

Context Length: 128K

Cluster Visualization

Node Pool Management
Node Pool Management Configure groups of similar compute nodes (node pools) to create your Kubernetes cluster infrastructure.
Node pools allow you to:

Create heterogeneous clusters with different hardware types

Separate workloads based on resource requirements

Simulate real-world Kubernetes deployments

Each node pool defines a set of identical nodes with the same CPU, memory, GPU type, and count.

Node Pool 1

Default compute node pool

Total: 3 nodes, 6 GPUs, 96 CPU cores, 384 GB memory

Pending Pods
Pending Pods Pods that couldn't be scheduled due to insufficient cluster resources.
Common reasons for pending status:

Not enough available GPUs on any node

Incompatible GPU types across all nodes

Insufficient CPU or memory resources

Special hardware requirements not met

To resolve pending pods, you can:

Add more nodes to the cluster

Increase resources on existing nodes

Delete other pods to free up resources

Modify the model configuration to use fewer resources

These pods can't be scheduled due to insufficient GPU resources. Consider updating the node pool configuration or reducing GPU requirements.

Console Output
Console Output Displays log messages and scheduling events from the simulator.
Output modes:

Standard Mode: Chronological event logs with timestamps

Text UI Mode: Structured textual view of cluster state

Use the console to track scheduling decisions, resource allocations, and error conditions.

Testing Examples

Click a test to view its details or apply it

CPU-Only Model

Gemma 2B on CPU nodes

Large GPU Model

Llama 70B on H100 GPUs

GPU Compatibility

Hardware compatibility testing

MoE Model

Mixtral 8x7B with advanced settings

Mixed Workload

Multiple model types and node pools

Resource Constraints

Testing scheduler with limited resources

NVIDIA NIM RAG Pipeline

Pre-validated NIM 70B RAG stack

CPU-Only Model (Gemma 2B)

Tests deployment of a small model (Google Gemma 2B) in CPU-only mode without requiring GPUs.

Configuration

Model: Google Gemma 2B (10GB)
CPU-Only Mode: Enabled
CPU Cores: 6
Memory: 24 GB
Replicas: 2

Recommended Node Pool

3 CPU-Only nodes
8 CPU cores per node
32 GB memory per node

Expected Outcome

Each pod requires 6 CPU cores and 24 GB memory. With the recommended node pool, you should be able to schedule 1 pod per node.

Key Learning Points

Small models (< 10B parameters) can run effectively on CPU-only deployments
CPU-only deployments require more CPU cores and memory per pod
No special GPU requirements means greater deployment flexibility
Pods will be scheduled on any nodes with sufficient CPU and memory

Large GPU Model (Llama 70B)

Tests deployment of a large model (Meta Llama 3.1 70B) requiring multiple H100 GPUs per pod.

Configuration

Model: Meta Llama 3.1 70B (290GB)
GPU Type: H100-80G
Inference Engine: vLLM
Replicas: 2

Recommended Node Pool

2 nodes with H100-80G GPUs
4 GPUs per node
64 CPU cores per node
256 GB memory per node

Expected Outcome

Each pod requires 2 H100 GPUs, 16 CPU cores, and 64 GB memory. With the recommended node pool, each node can run 2 pods, allowing all 2 replicas to be scheduled.

Key Learning Points

Large models (> 65B parameters) typically require multiple GPUs per pod
High-end GPUs like H100s provide better performance for large models
Pod resource requirements scale with model size
Node capacity determines how many pods can be scheduled per node
The vLLM inference engine optimizes memory usage and throughput

GPU Compatibility Test

Tests GPU compatibility handling with NVIDIA models that have specific hardware requirements.

Configuration

Model: NVIDIA Llama 3.3 70B (160GB)
Inference Engine: NVIDIA NIM (auto-selected)
Hardware Requirement: H100 GPUs only

Node Pool Setup

Create a node pool with A100-80G GPUs (incompatible)
Create a node pool with H100-80G GPUs (compatible)
Deploy the model and observe scheduling

Expected Outcome

The scheduler should recognize GPU type incompatibility, placing pods on H100 nodes while marking pods as pending with a GPU type incompatibility message if only A100 nodes are available.

Key Learning Points

Some models have strict hardware compatibility requirements
NVIDIA NGC models often require specific GPU architectures
The scheduler checks GPU type compatibility before placing pods
Incompatible GPU types lead to pods staying in pending state
Hardware-specific optimizations in NVIDIA NIM require compatible GPUs

MoE Model & Advanced Settings

Tests Mixture of Experts models with advanced resource settings for fine-tuned deployments.

Configuration

Model: Mistral AI Mixtral 8x7B (200GB)
GPU Type: A100-80G
Enable Advanced Settings with GPU Override set to 4
vLLM CPU Multiplier: 1.5
vLLM Memory Multiplier: 1.8

Recommended Node Pool

1 node with A100-80G GPUs
8 GPUs per node
96 CPU cores per node
384 GB memory per node

Expected Outcome

Each pod requires 4 GPUs with increased CPU and memory due to the multipliers. The pod should schedule on a node with sufficient resources.

Key Learning Points

MoE models activate only a subset of their parameters for each input, making them more efficient
Advanced resource settings allow fine-tuning of deployment parameters
Resource multipliers adjust CPU and memory allocation based on the inference engine
GPU count overrides allow specifying exact GPU requirements
High-capacity nodes are needed for resource-intensive MoE models

Mixed Workload Testing

Tests a heterogeneous cluster with multiple node pools and diverse workloads.

Node Pool Setup

H100 pool: 2 nodes, 2 GPUs each, 64 CPU, 256 GB memory
L40S pool: 3 nodes, 4 GPUs each, 48 CPU, 192 GB memory
CPU pool: 4 nodes, 0 GPUs, 32 CPU, 128 GB memory

Workload Deployment

Deploy Llama 3.1 70B on H100 pool
Deploy Mixtral 8x7B on L40S pool
Deploy Gemma 2B in CPU-only mode
Create resource contention by deploying additional pods

Expected Outcome

Pods should be scheduled on their compatible node pools. Once resources are exhausted, pods should go to pending state with proper error messages.

Key Learning Points

Heterogeneous clusters can efficiently support diverse workloads
Different node pools can be optimized for specific model types
Node pool labels help with workload targeting
CPU-only workloads can coexist with GPU workloads
The scheduler intelligently places pods based on resource requirements and compatibility

Resource Constraints

Tests how the scheduler handles resource constraints and pending pods.

Node Pool Setup

Single node pool with 2 nodes
4 GPUs per node (H100-80G)
64 CPU cores per node
256 GB memory per node

Test Steps

Deploy Mixtral 8x7B requiring 2 GPUs each (2 replicas)
Deploy Meta Llama 3.1 70B requiring 2 GPUs each (2 replicas)
Try to deploy one more replica of Mixtral
Delete one pod and observe rescheduling

Expected Outcome

Initial 4 pods will fill the cluster (2 GPUs × 4 pods = 8 GPUs total). The 5th pod should be pending due to insufficient resources. When a pod is deleted, the pending pod should automatically schedule.

Key Learning Points

Kubernetes tracks pending pods and attempts to schedule them when resources become available
Pods go to pending state when there are insufficient cluster resources
When resources are freed (by deleting pods), the scheduler automatically tries to place pending pods
Capacity planning is important to avoid resource contention
Resource constraints can be diagnosed through pod status and events

NVIDIA NIM RAG Pipeline

Tests deployment of a complete NVIDIA NIM RAG pipeline with pre-validated components for enterprise production deployments.

About NVIDIA NIM RAG Pipeline

This integrated stack deploys a production-ready Retrieval Augmented Generation (RAG) pipeline using NVIDIA's pre-validated NIM (NVIDIA Inference Microservices) components. All microservices are optimized for NVIDIA hardware and include built-in security and monitoring capabilities.

LLM Inference

NVIDIA NIM LLM 70B model with TensorRT-LLM acceleration

Embedding

E5-large-v2 embedding model optimized with FasterTransformer

Reranker

BGE reranker model with quantization optimizations

Guardrails

NeMo Guardrails for content safety and compliance

Node Pool Setup

High-performance node pool with 2 nodes
8 GPUs per node (H100-80G)
128 CPU cores per node
512 GB memory per node
NVMe storage for vector database

Resource Distribution

• LLM Inference: 6 GPUs, 96 CPU cores, 384 GB RAM
• Embedding: 1 GPU, 16 CPU cores, 64 GB RAM
• Reranker: 1 GPU, 8 CPU cores, 32 GB RAM
• Guardrails: Shares LLM container resources

Expected Outcome

The simulation will deploy a complete RAG pipeline with four interlinked components across the cluster. Components will be scheduled with anti-affinity rules to ensure high availability.

Performance Metrics

• Query throughput: 100 QPS
• End-to-end latency: < 300ms
• Resource efficiency: 2× better than non-NIM deployments
• Auto-scaling with traffic patterns

Optimization Features

TensorRT-LLM inference optimization with FP8 quantization
Multi-GPU tensor parallelism for the 70B model
NVFuser operator fusion for embedding models
Rayon parallel scheduling for reranking operations
GPU-accelerated vector search with FAISS
Continuous batching with PagedAttention KV cache

Production Features

Built-in Triton metrics export for Prometheus
Automatic fault tolerance with pod anti-affinity
NeMo Guardrails for content safety and compliance
Input/output validation with JSON Schema
Rate limiting and quota management
GPU failure monitoring and auto-recovery

Key Learning Points

Complete RAG pipelines require coordinated deployment of multiple model types with different resource profiles
Pre-validated components reduce deployment complexity and ensure compatibility
High-capacity nodes with multiple GPUs are ideal for large-scale inference workloads
Production features like monitoring, guardrails, and fault tolerance are critical for enterprise deployments
Optimized containers from trusted sources (like NVIDIA) provide significant performance advantages over custom builds
Kubernetes pod affinity and anti-affinity rules can be used to optimize placement for both performance and availability

Nutanix Enterprise AI LLM Scheduling Simulator - v1

LLM Model Configuration

Engine Resource Multipliers

Pod Resource Requirements:

Cluster Visualization

Node Pool 1

Scheduler Animation

Testing Examples

CPU-Only Model

Large GPU Model

GPU Compatibility

MoE Model

Mixed Workload

Resource Constraints

NVIDIA NIM RAG Pipeline

CPU-Only Model (Gemma 2B)

Configuration

Recommended Node Pool

Expected Outcome

Key Learning Points

Large GPU Model (Llama 70B)

Configuration

Recommended Node Pool

Expected Outcome

Key Learning Points

GPU Compatibility Test

Configuration

Node Pool Setup

Expected Outcome

Key Learning Points

MoE Model & Advanced Settings

Configuration

Recommended Node Pool

Expected Outcome

Key Learning Points

Mixed Workload Testing

Node Pool Setup

Workload Deployment

Expected Outcome

Key Learning Points

Resource Constraints

Node Pool Setup

Test Steps

Expected Outcome

Key Learning Points

NVIDIA NIM RAG Pipeline

About NVIDIA NIM RAG Pipeline

LLM Inference

Embedding

Reranker

Guardrails

Node Pool Setup

Resource Distribution

Expected Outcome

Performance Metrics

Optimization Features

Production Features

Key Learning Points