Program Plan (Total: 34 weeks) — with Books per Phase & NVIDIA Certs in Phase 3

1) 6 weeks

Linux Under the Hood, 2nd Edition
Study guide: RH342 Red Hat Enterprise Linux Diagnostics and Troubleshooting
Study guide: Red Hat Certified Specialist in Performance Tuning (EX442)
EX442 EXAMS - TAKE

Books for this phase:

Systems Performance, 2nd Edition - Greg
- Chapter 2. Methodologies
- Chapter 3. Operating Systems
- Chapter 4. Observability Tools
- Chapter 5. Applications
- Chapter 6. CPUs
- Chapter 7. Memory
- Chapter 8. File Systems
- Chapter 9. Disks
- Chapter 10. Network
- Chapter 12. Benchmarking
- Chapter 13. perf
- Chapter 14. Ftrace
- Chapter 15. BPF
Linux kernel development
- 3. Process Management
- 1. Process Scheduling
- 1. System Calls
- 1. Timers and Time Management
- 1. Memory Management
- 1. The Virtual Filesystem
- 1. The Block I/O Layer
- 1. The Process Address Space

2) 8 weeks

Red Hat Certified Specialist in Containers exam | EX188 - TAKE
Docker containerization specifically for machine learning applications.
Kubernetes:

CKA - STUDY ONLY
Kubernetes fundamentals with a focus on deploying and managing ML workloads. Observability:
Monitoring Systems and Services with Prometheus (LFS241)
monitoring systems that track infrastructure health, application performance, and ML model behavior in real-time.

Books for this phase:

Docker Deep Dive
- 1: Containers from 30,000 feet
- 2: Docker and container-related standards and projects
- 3: Getting Docker
- 4: The big picture
- 5: The Docker Engine
- 6: Working with Images
- 7: Working with containers
- 8: Containerizing an app
- 9: Multi-container apps with Compose
- 10: Docker and AI
- 13: Docker Networking
- 14: Docker overlay networking
- 15: Volumes and persistent data
- 16: Docker security
The Kubernetes Book
- 1: Kubernetes primer
- 2: Kubernetes principles of operation
- 3: Getting Kubernetes
- 4: Working with Pods
- 5: Virtual clusters with Namespaces
- 6: Kubernetes Deployments
- 7: Kubernetes Services
- 8: Ingress
- 10: Service discovery deep dive
- 11: Kubernetes storage
- 12: ConfigMaps and Secrets
- 13: StatefulSets
- 14: API security and RBAC
- 15: The Kubernetes API
- 16: Threat modeling Kubernetes
- 17: Real-world Kubernetes security
(Observability focus from) Systems Performance, 2nd Edition - Greg
- Chapter 4. Observability Tools
- Chapter 12. Benchmarking
- Chapter 13. perf
- Chapter 14. Ftrace
- Chapter 15. BPF

3) 8 weeks

GPU:

NVIDIA-Certified Professional: AI Infrastructure (NCP-AII) - STUDY ONLY
NVIDIA-Certified Professional: AI Operations - STUDY ONLY
NVIDIA-Certified Professional: AI Networking - STUDY ONLY
https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/tree/main/lessons/mod-107-gpu-computing

Books for this phase:

AI Systems Performance Engineering
- 2. AI System Hardware Overview
- Chapter 3. OS, Docker, and Kubernetes Tuning for GPU-Based Environments
- Chapter 4. Tuning Distributed Networking Communication
- Chapter 5. GPU-Based Storage I/O Optimizations
- Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy
- Chapter 7: Profiling and Tuning GPU Memory Access Patterns
Generative AI on Kubernetes
- Chapter 4. Kubernetes and GPUs

NVIDIA-Certified Professional (NCP) — Topic Outlines

NVIDIA-Certified Professional: AI Infrastructure (NCP-AII)

System and Server Bring-up — 31%

Describe sequence of events for deployment and validation.
Describe network topologies for AI factories.
Perform initial configuration of BMC, OOB, and TPM.
Perform firmware upgrades (including on HGX™) and fault detection.
Validate power and cooling parameters.
Install GPU-based servers (SMI).
Validate installed hardware.
Describe and validate cable types and transceivers.
Install physical GPUs.
Validate hardware operation for workloads.
Configure initial parameters for third-party storage.

Physical Layer Management — 5%

Configure and manage a BlueField® network platform.
Configure MIG (AI and HPC).

Control Plane Installation and Configuration — 19%

Install Base Command™ Manager (BCM), configure and verify HA.
Install OS.
Install Cluster (configure category, configure interfaces, install Slurm/Enroot/Pyxis).
Install/update/remove NVIDIA GPU and DOCA™ drivers.
Install the NVIDIA container toolkit.
Demonstrate how to use NVIDIA GPUs with Docker.
Install NGC™ CLI on hosts.

Cluster Test and Verification — 33%

Perform a single-node stress test.
Execute HPL (High-Performance Linpack).
Perform single-node NCCL (including verifying NVLink™ Switch).
Validate cables by verifying signal quality.
Confirm cabling is correct.
Confirm FW/SW on switches.
Confirm FW/SW on BlueField-3.
Confirm FW on transceivers.
Run ClusterKit to perform a multifaceted node assessment.
Run NCCL to verify E/W fabric bandwidth.
Perform NCCL burn-in.
Perform HPL burn-in.
Perform NeMo™ burn-in.
Test storage.

Troubleshoot and Optimize — 12%

Identify and troubleshoot hardware faults (e.g., GPU, fan, network card).
Identify faulty cards, GPUs, and power supplies.
Replace faulty cards, GPUs, and power supplies.
Execute performance optimization for AMD and Intel servers.
Optimize storage.

NVIDIA-Certified Professional: AI Operations

Installation and Deployment — 31%

Describe the Mission Control toolkit
Use BCM’s Base View interface to monitor cluster performance, resource utilization, and node health in real time.
Manage job scheduling and resource allocation using BCM’s workload manager (e.g., SLURM or Kubernetes)
Apply patches, update firmware, and synchronize software images across cluster nodes using BCM
Administer user accounts, roles, and permissions to ensure secure access to the cluster using BCM
Configure and monitor network settings for cluster nodes, DPUs, and switches using BCM
Diagnose and resolve cluster issues, such as job failures, node outages, or resource bottlenecks, using BCM.
Use BCM to organize and configure compute nodes into categories based on hardware or workload requirements.
Using BCM, maintain documentation and generate reports on cluster usage, performance, and issues.
Install and initialize Kubernetes on NVIDIA hosts using BCM
Deploy DOCA Services on DPU Arm
Install Run:ai
Install Slurm

Administration — 23%

Administer Slurm cluster.
Describe data center architecture for AI Workloads
Administer Run:ai
Administer Kubernetes
Configure MIG

Workload Management — 23%

Deploy inference workloads with Kubernetes
Deploy inference workloads with Run:ai
Deploy training workloads with Slurm
Deploy training workloads with Run:ai
Use system management tools to troubleshoot issues
Allocate resources between teams with Run:ai, Slurm and Kubernetes
Deploy containers from NGC

Troubleshooting and Optimization — 23%

Troubleshoot Docker
Troubleshoot the fabric manager service for NVLink and NVSwitch systems
Troubleshoot Base Command Manager
Troubleshoot Magnum IO components
Troubleshoot storage performance
Troubleshoot the deployment of a container from NGC

NVIDIA-Certified Professional: AI Networking

AI Data Center Design and Optimization — 5%

Describe an AI factory networking architecture and its components (e.g., GPUs, BlueField, Scalable Unit (SU), switches).
Describe rail-optimized topologies for high-performance AI workloads.
Describe GPU-to-GPU communications.

NVIDIA Spectrum Networking — 30%

Configure NVIDIA Spectrum-X switches for RoCE (RDMA over Converged Ethernet) to enable high-speed, low-latency communication.
Enable and verify quality of service (QoS, ECN, PFC), advanced features (like adaptive routing), and telemetry.
Configure multi-tenancy BGP-EVPN to isolate tenant workloads.
Use NVIDIA Air to simulate network environments and identify potential issues.
Diagnose congestion or packet loss using in-band telemetry and NVIDIA® What Just Happened® (WJH) services.
Use NetQ™ for real-time network monitoring, including congestion detection and latency measurements.
Install DOCA™.
Configure SuperNIC™ functionality for advanced packet processing and congestion control.

NVIDIA InfiniBand Networking — 30%

Perform initial configuration and provisioning, including high availability (HA).
Configure partition keys (PKeys) to ensure secure multi-tenancy in InfiniBand networks.
Configure QoS and adaptive routing to dynamically adjust paths based on congestion.
Use UFM to monitor InfiniBand link status and bandwidth utilization.

Kubernetes Integration — 5%

Deploy the NVIDIA Network Operator to manage RDMA interfaces and InfiniBand networks within Kubernetes clusters.
Verify NVIDIA Network Operator functionality.

Troubleshooting Tools — 20%

Use tools (like cl-resource-query) to check resource allocation in Spectrum-X environments.
Use What Just Happened (WJH) services for real-time event analysis.
Verify low-latency interconnects between GPUs, CPUs, and storage systems.
Use UFM system health to diagnose IB issues.
Use commands like ib_write_lat, ib_write_bw, ibping, ibstat, ibdiagnet, ibnodes, and iblinkinfo for diagnosing connectivity issues.

Automation and Configuration — 10%

Manage Spectrum-X switch configurations through NVUE templates.
Write Ansible playbooks to automate network setup tasks like VLAN creation or RoCE configuration.

4) 4 weeks

Google Cloud Platform for ML Infrastructure

https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/blob/main/lessons/mod-102-cloud-computing/03-gcp-ml-infrastructure.md

Books for this phase:

(No specific books assigned from your list for this phase.)

5) 2 weeks

Infrastructure as Code (IaC):

Terraform
Ansible
https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/tree/main/lessons/mod-109-infrastructure-as-code

Books for this phase:

(No specific books assigned from your list for this phase.)

6) 4 weeks

LLM Infrastructure:

https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/tree/main/lessons/mod-110-llm-infrastructure

Books for this phase:

(No specific books assigned from your list for this phase.)

7) 2 weeks

Project:

Project 01: Basic Model Serving System
Project 02: LLM Deployment Platform

Books for this phase:

(No specific books assigned from your list for this phase.)

Certifications — Summary

Red Hat
- EX442 — TAKE (Phase 1)
- EX188 — TAKE (Phase 2)
- CKA — STUDY ONLY (Phase 2)
NVIDIA-Certified Professional (NCP) (Phase 3 — STUDY ONLY)
- NCP-AII: AI Infrastructure — STUDY ONLY
- NCP: AI Operations — STUDY ONLY
- NCP: AI Networking — STUDY ONLY