Program Plan (Total: 34 weeks) — with Books per Phase & NVIDIA Certs in Phase 3
1) 6 weeks
Linux Under the Hood, 2nd Edition
Study guide: RH342 Red Hat Enterprise Linux Diagnostics and Troubleshooting
Study guide: Red Hat Certified Specialist in Performance Tuning (EX442)
EX442 EXAMS - TAKE
Books for this phase:
- Systems Performance, 2nd Edition - Greg
- Chapter 2. Methodologies
- Chapter 3. Operating Systems
- Chapter 4. Observability Tools
- Chapter 5. Applications
- Chapter 6. CPUs
- Chapter 7. Memory
- Chapter 8. File Systems
- Chapter 9. Disks
- Chapter 10. Network
- Chapter 12. Benchmarking
- Chapter 13. perf
- Chapter 14. Ftrace
- Chapter 15. BPF
- Linux kernel development
- 3. Process Management
-
- Process Scheduling
-
- System Calls
-
- Timers and Time Management
-
- Memory Management
-
- The Virtual Filesystem
-
- The Block I/O Layer
-
- The Process Address Space
2) 8 weeks
Red Hat Certified Specialist in Containers exam | EX188 - TAKE
Docker containerization specifically for machine learning applications.
Kubernetes:
- CKA - STUDY ONLY
- Kubernetes fundamentals with a focus on deploying and managing ML workloads. Observability:
- Monitoring Systems and Services with Prometheus (LFS241)
- monitoring systems that track infrastructure health, application performance, and ML model behavior in real-time.
Books for this phase:
- Docker Deep Dive
- 1: Containers from 30,000 feet
- 2: Docker and container-related standards and projects
- 3: Getting Docker
- 4: The big picture
- 5: The Docker Engine
- 6: Working with Images
- 7: Working with containers
- 8: Containerizing an app
- 9: Multi-container apps with Compose
- 10: Docker and AI
- 13: Docker Networking
- 14: Docker overlay networking
- 15: Volumes and persistent data
- 16: Docker security
- The Kubernetes Book
- 1: Kubernetes primer
- 2: Kubernetes principles of operation
- 3: Getting Kubernetes
- 4: Working with Pods
- 5: Virtual clusters with Namespaces
- 6: Kubernetes Deployments
- 7: Kubernetes Services
- 8: Ingress
- 10: Service discovery deep dive
- 11: Kubernetes storage
- 12: ConfigMaps and Secrets
- 13: StatefulSets
- 14: API security and RBAC
- 15: The Kubernetes API
- 16: Threat modeling Kubernetes
- 17: Real-world Kubernetes security
- (Observability focus from) Systems Performance, 2nd Edition - Greg
- Chapter 4. Observability Tools
- Chapter 12. Benchmarking
- Chapter 13. perf
- Chapter 14. Ftrace
- Chapter 15. BPF
3) 8 weeks
GPU:
- NVIDIA-Certified Professional: AI Infrastructure (NCP-AII) - STUDY ONLY
- NVIDIA-Certified Professional: AI Operations - STUDY ONLY
- NVIDIA-Certified Professional: AI Networking - STUDY ONLY
- https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/tree/main/lessons/mod-107-gpu-computing
Books for this phase:
- AI Systems Performance Engineering
- 2. AI System Hardware Overview
- Chapter 3. OS, Docker, and Kubernetes Tuning for GPU-Based Environments
- Chapter 4. Tuning Distributed Networking Communication
- Chapter 5. GPU-Based Storage I/O Optimizations
- Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy
- Chapter 7: Profiling and Tuning GPU Memory Access Patterns
- Generative AI on Kubernetes
- Chapter 4. Kubernetes and GPUs
NVIDIA-Certified Professional (NCP) — Topic Outlines
NVIDIA-Certified Professional: AI Infrastructure (NCP-AII)
System and Server Bring-up — 31%
- Describe sequence of events for deployment and validation.
- Describe network topologies for AI factories.
- Perform initial configuration of BMC, OOB, and TPM.
- Perform firmware upgrades (including on HGX™) and fault detection.
- Validate power and cooling parameters.
- Install GPU-based servers (SMI).
- Validate installed hardware.
- Describe and validate cable types and transceivers.
- Install physical GPUs.
- Validate hardware operation for workloads.
- Configure initial parameters for third-party storage.
Physical Layer Management — 5%
- Configure and manage a BlueField® network platform.
- Configure MIG (AI and HPC).
Control Plane Installation and Configuration — 19%
- Install Base Command™ Manager (BCM), configure and verify HA.
- Install OS.
- Install Cluster (configure category, configure interfaces, install Slurm/Enroot/Pyxis).
- Install/update/remove NVIDIA GPU and DOCA™ drivers.
- Install the NVIDIA container toolkit.
- Demonstrate how to use NVIDIA GPUs with Docker.
- Install NGC™ CLI on hosts.
Cluster Test and Verification — 33%
- Perform a single-node stress test.
- Execute HPL (High-Performance Linpack).
- Perform single-node NCCL (including verifying NVLink™ Switch).
- Validate cables by verifying signal quality.
- Confirm cabling is correct.
- Confirm FW/SW on switches.
- Confirm FW/SW on BlueField-3.
- Confirm FW on transceivers.
- Run ClusterKit to perform a multifaceted node assessment.
- Run NCCL to verify E/W fabric bandwidth.
- Perform NCCL burn-in.
- Perform HPL burn-in.
- Perform NeMo™ burn-in.
- Test storage.
Troubleshoot and Optimize — 12%
- Identify and troubleshoot hardware faults (e.g., GPU, fan, network card).
- Identify faulty cards, GPUs, and power supplies.
- Replace faulty cards, GPUs, and power supplies.
- Execute performance optimization for AMD and Intel servers.
- Optimize storage.
NVIDIA-Certified Professional: AI Operations
Installation and Deployment — 31%
- Describe the Mission Control toolkit
- Use BCM’s Base View interface to monitor cluster performance, resource utilization, and node health in real time.
- Manage job scheduling and resource allocation using BCM’s workload manager (e.g., SLURM or Kubernetes)
- Apply patches, update firmware, and synchronize software images across cluster nodes using BCM
- Administer user accounts, roles, and permissions to ensure secure access to the cluster using BCM
- Configure and monitor network settings for cluster nodes, DPUs, and switches using BCM
- Diagnose and resolve cluster issues, such as job failures, node outages, or resource bottlenecks, using BCM.
- Use BCM to organize and configure compute nodes into categories based on hardware or workload requirements.
- Using BCM, maintain documentation and generate reports on cluster usage, performance, and issues.
- Install and initialize Kubernetes on NVIDIA hosts using BCM
- Deploy DOCA Services on DPU Arm
- Install Run:ai
- Install Slurm
Administration — 23%
- Administer Slurm cluster.
- Describe data center architecture for AI Workloads
- Administer Run:ai
- Administer Kubernetes
- Configure MIG
Workload Management — 23%
- Deploy inference workloads with Kubernetes
- Deploy inference workloads with Run:ai
- Deploy training workloads with Slurm
- Deploy training workloads with Run:ai
- Use system management tools to troubleshoot issues
- Allocate resources between teams with Run:ai, Slurm and Kubernetes
- Deploy containers from NGC
Troubleshooting and Optimization — 23%
- Troubleshoot Docker
- Troubleshoot the fabric manager service for NVLink and NVSwitch systems
- Troubleshoot Base Command Manager
- Troubleshoot Magnum IO components
- Troubleshoot storage performance
- Troubleshoot the deployment of a container from NGC
NVIDIA-Certified Professional: AI Networking
AI Data Center Design and Optimization — 5%
- Describe an AI factory networking architecture and its components (e.g., GPUs, BlueField, Scalable Unit (SU), switches).
- Describe rail-optimized topologies for high-performance AI workloads.
- Describe GPU-to-GPU communications.
NVIDIA Spectrum Networking — 30%
- Configure NVIDIA Spectrum-X switches for RoCE (RDMA over Converged Ethernet) to enable high-speed, low-latency communication.
- Enable and verify quality of service (QoS, ECN, PFC), advanced features (like adaptive routing), and telemetry.
- Configure multi-tenancy BGP-EVPN to isolate tenant workloads.
- Use NVIDIA Air to simulate network environments and identify potential issues.
- Diagnose congestion or packet loss using in-band telemetry and NVIDIA® What Just Happened® (WJH) services.
- Use NetQ™ for real-time network monitoring, including congestion detection and latency measurements.
- Install DOCA™.
- Configure SuperNIC™ functionality for advanced packet processing and congestion control.
NVIDIA InfiniBand Networking — 30%
- Perform initial configuration and provisioning, including high availability (HA).
- Configure partition keys (PKeys) to ensure secure multi-tenancy in InfiniBand networks.
- Configure QoS and adaptive routing to dynamically adjust paths based on congestion.
- Use UFM to monitor InfiniBand link status and bandwidth utilization.
Kubernetes Integration — 5%
- Deploy the NVIDIA Network Operator to manage RDMA interfaces and InfiniBand networks within Kubernetes clusters.
- Verify NVIDIA Network Operator functionality.
Troubleshooting Tools — 20%
- Use tools (like cl-resource-query) to check resource allocation in Spectrum-X environments.
- Use What Just Happened (WJH) services for real-time event analysis.
- Verify low-latency interconnects between GPUs, CPUs, and storage systems.
- Use UFM system health to diagnose IB issues.
- Use commands like ib_write_lat, ib_write_bw, ibping, ibstat, ibdiagnet, ibnodes, and iblinkinfo for diagnosing connectivity issues.
Automation and Configuration — 10%
- Manage Spectrum-X switch configurations through NVUE templates.
- Write Ansible playbooks to automate network setup tasks like VLAN creation or RoCE configuration.
4) 4 weeks
Google Cloud Platform for ML Infrastructure
Books for this phase:
- (No specific books assigned from your list for this phase.)
5) 2 weeks
Infrastructure as Code (IaC):
- Terraform
- Ansible
- https://github.com/ai-infra-curriculum/ai-infra-engineer-learning/tree/main/lessons/mod-109-infrastructure-as-code
Books for this phase:
- (No specific books assigned from your list for this phase.)
6) 4 weeks
LLM Infrastructure:
Books for this phase:
- (No specific books assigned from your list for this phase.)
7) 2 weeks
Project:
- Project 01: Basic Model Serving System
- Project 02: LLM Deployment Platform
Books for this phase:
- (No specific books assigned from your list for this phase.)
Certifications — Summary
-
Red Hat
- EX442 — TAKE (Phase 1)
- EX188 — TAKE (Phase 2)
- CKA — STUDY ONLY (Phase 2)
-
NVIDIA-Certified Professional (NCP) (Phase 3 — STUDY ONLY)
- NCP-AII: AI Infrastructure — STUDY ONLY
- NCP: AI Operations — STUDY ONLY
- NCP: AI Networking — STUDY ONLY