Nutanix Enterprise AI LLM Scheduling Simulator - v1

Visualize how LLM model deployments are scheduled across Kubernetes nodepools based on compute and GPU requirements

LLM Model Configuration

Model Source:
Model Source Options Filter models by their source repository:
  • HuggingFace: Models from HuggingFace Hub with standard implementations
  • NVIDIA: Optimized models from NVIDIA NGC catalog, often with specialized containers and NIM runtime
Pre-validated Models When enabled, only shows models that have been pre-validated for production deployments on Kubernetes.
Pre-validated models have undergone comprehensive testing for:
  • Stability under various load patterns
  • Performance optimization for target hardware
  • Resource usage accuracy
  • Compatibility with Kubernetes scheduling
Model Selection Choose the LLM model to deploy. Each model has different resource requirements, capabilities, and compatibility with hardware.
Key model attributes:
  • Parameter size (B = billion)
  • Storage requirements
  • Context length (maximum tokens)
  • Required GPU types and counts
Model sources:
  • HF icon: Available from HuggingFace
  • NGC icon: Available from NVIDIA NGC
  • ✓ mark: Pre-validated for production
Inference Engine Software framework that executes the model and handles inference requests.
Available engines:
  • TGI: Text Generation Inference - Good stability, supports advanced features like LoRA
  • vLLM: Optimized for high throughput and continuous batching with PagedAttention
  • NVIDIA NIM: NVIDIA's optimized inference microservices for NVIDIA models
Each engine has different resource needs, throughput characteristics, and feature support.
GPU Device Type The GPU hardware used for model inference. Different models have different compatibility and performance characteristics across GPU types.
Available GPU types:
  • A100-80G: High performance, 80GB VRAM, good for most models up to 70B
  • H100-80G: Latest generation, 80GB VRAM, best performance for all models
  • L40S-48G: Mid-tier, 48GB VRAM, economical for models up to 13B
  • H100-NVL-94G: Enhanced H100 with 94GB VRAM, highest capacity for large models
Larger models require more GPU memory. The list shows only GPUs compatible with your selected model.
GPU Count Override Manually set the number of GPUs used for model deployment, overriding the model's default recommendation.
Key considerations:
  • Higher GPU counts improve throughput for high concurrency workloads
  • Some models require minimum GPU counts based on model size
  • Using more GPUs than needed wastes resources
  • Using fewer GPUs than recommended can cause OOM errors
The dropdown shows only GPU counts that are technically feasible for the selected model and GPU type.

Engine Resource Multipliers

Engine Resource Multipliers Fine-tune how resource requirements are calculated based on the inference engine.
How it works: Base Resource × Engine Multiplier = Final Resource Requirement
Example: If a model needs 10 CPU cores and you use vLLM with a 1.2 multiplier, the final requirement will be 12 CPU cores.
Typical multiplier ranges:
  • TGI: CPU: 0.8-1.0, Memory: 1.0-1.2
  • vLLM: CPU: 1.0-1.5, Memory: 1.2-1.8
  • NVIDIA NIM: CPU: 0.7-1.0, Memory: 0.7-1.0
TGI
TGI Resource Multipliers Text Generation Inference typically has moderate resource requirements.
Recommended settings:
  • CPU: 0.8-1.0 (less CPU-intensive)
  • Memory: 1.0-1.2 (standard memory usage)
vLLM
vLLM Resource Multipliers vLLM uses PagedAttention and requires more resources for optimal performance.
Recommended settings:
  • CPU: 1.2-1.5 (higher for KV cache management)
  • Memory: 1.3-1.8 (needs extra memory for paging)
NVIDIA NIM (NVIDIA models only)
NVIDIA NIM Resource Multipliers NVIDIA's optimized inference microservices often use less resources due to optimizations.
Recommended settings:
  • CPU: 0.7-0.9 (highly optimized CPU usage)
  • Memory: 0.7-0.9 (memory-efficient implementations)
Only available for NVIDIA-sourced models from NGC catalog.
Number of Pods (Replicas) The number of identical model instances to deploy across the cluster.
When to use multiple pods:
  • Horizontal scaling for high throughput
  • High availability and fault tolerance
  • Geographic distribution (when simulating multiple clusters)
Performance impact: Each pod requires its own dedicated resources. Multiple pods provide linear throughput scaling but don't improve single-request latency.
Deploy Model Deploy the configured model to the Kubernetes cluster with the specified resources.
The deployment process:
  • Calculate exact resource requirements
  • Create Kubernetes pod specifications
  • Schedule pods based on available node resources
  • Track deployment status and resource allocation
Pods will be placed on nodes with sufficient resources and compatible GPU types. If resources are insufficient, pods will be placed in a pending state.

Pod Resource Requirements:

Resource Calculation Method
Base requirements are determined by:
  • Model size and architecture
  • Selected GPU type compatibility
  • Inference engine optimization profile
Final requirements:
  • CPU: Base CPU × Engine Multiplier
  • Memory: Base Memory × Engine Multiplier
  • GPUs: Either default for model/GPU or override value
  • Storage: Fixed requirement based on model size
These requirements determine where pods can be scheduled in the cluster and how many can run simultaneously.

CPU: 16 cores

Memory: 64 GB

GPU: 2 × NVIDIA H100-80G

Storage: 290 GB

Engine: vLLM

Context Length: 128K

Cluster Visualization

Node Pool Management
Node Pool Management Configure groups of similar compute nodes (node pools) to create your Kubernetes cluster infrastructure.
Node pools allow you to:
  • Create heterogeneous clusters with different hardware types
  • Separate workloads based on resource requirements
  • Simulate real-world Kubernetes deployments
Each node pool defines a set of identical nodes with the same CPU, memory, GPU type, and count.

Node Pool 1

Default compute node pool

Number of Nodes Total count of identical nodes in this pool.
Each node represents a physical or virtual machine with its own CPU, memory, and GPU resources. Higher node counts provide more total cluster capacity for running model deployments.
GPUs per Node Number of GPU devices available on each node in this pool.
Common configurations:
  • 0: CPU-only nodes
  • 1-2: Entry-level GPU nodes
  • 4-8: High-performance computing nodes
The scheduler will ensure models requiring N GPUs are placed on nodes with at least N available GPUs.
GPU Device Type The specific GPU hardware model for nodes in this pool.
Options:
  • A100-80G: High performance, 80GB VRAM
  • H100-80G: Latest generation, 80GB VRAM
  • L40S-48G: Mid-tier, 48GB VRAM
  • H100-NVL-94G: Enhanced H100 with 94GB VRAM
  • CPU-Only: No GPUs attached
The scheduler ensures models are only placed on nodes with compatible GPU types.
CPU Cores per Node Number of CPU cores available on each node in this pool.
Recommendations:
  • CPU-only nodes: 16-64 cores
  • GPU nodes: 32-128 cores
LLM inference is often CPU-bound for token generation. Sufficient CPU cores are needed for preprocessing, postprocessing, and managing concurrent requests.
Memory per Node Amount of RAM available on each node in this pool, measured in gigabytes.
Recommendations:
  • CPU-only nodes: 32-256 GB
  • GPU nodes: 64-512 GB
While GPU memory holds the model weights, system RAM is needed for input/output processing, batching, and the inference engine runtime.
Node Pool Label Optional label to identify this node pool for node selection.
Common labels:
  • gpu-pool: General GPU nodes
  • cpu-pool: CPU-only nodes
  • high-memory: Memory-optimized nodes
  • a100-pool: For specific GPU type
Labels can be used with node affinity/anti-affinity rules to control which workloads run on which nodes.
Total: 3 nodes, 6 GPUs, 96 CPU cores, 384 GB memory
Cluster Visualization This area shows all nodes in the cluster and their current resource usage.
Node components:
  • Node ID and hardware type
  • Resource utilization bars (CPU, Memory, GPU)
  • Deployed pods with provider and resource information
Interactions:
  • Click on a pod's × button to delete it
  • Deploy models using the configuration panel
  • Reset the cluster to clear all pods

Console Output
Console Output Displays log messages and scheduling events from the simulator.
Output modes:
  • Standard Mode: Chronological event logs with timestamps
  • Text UI Mode: Structured textual view of cluster state
Use the console to track scheduling decisions, resource allocations, and error conditions.

Testing Examples

Click a test to view its details or apply it

CPU-Only Model

Gemma 2B on CPU nodes

Large GPU Model

Llama 70B on H100 GPUs

GPU Compatibility

Hardware compatibility testing

MoE Model

Mixtral 8x7B with advanced settings

Mixed Workload

Multiple model types and node pools

Resource Constraints

Testing scheduler with limited resources

NVIDIA NIM RAG Pipeline

Pre-validated NIM 70B RAG stack

Large GPU Model (Llama 70B)

Tests deployment of a large model (Meta Llama 3.1 70B) requiring multiple H100 GPUs per pod.

Configuration

  • Model: Meta Llama 3.1 70B (290GB)
  • GPU Type: H100-80G
  • Inference Engine: vLLM
  • Replicas: 2

Recommended Node Pool

  • 2 nodes with H100-80G GPUs
  • 4 GPUs per node
  • 64 CPU cores per node
  • 256 GB memory per node

Expected Outcome

Each pod requires 2 H100 GPUs, 16 CPU cores, and 64 GB memory. With the recommended node pool, each node can run 2 pods, allowing all 2 replicas to be scheduled.

Key Learning Points

  • Large models (> 65B parameters) typically require multiple GPUs per pod
  • High-end GPUs like H100s provide better performance for large models
  • Pod resource requirements scale with model size
  • Node capacity determines how many pods can be scheduled per node
  • The vLLM inference engine optimizes memory usage and throughput