AI Infrastructure Engineer

Become an AI Infrastructure Engineer


Program Plan (Total: 34 weeks) — with Books per Phase & NVIDIA Certs in Phase 3

1) 6 weeks

Linux Under the Hood, 2nd Edition
Study guide: RH342 Red Hat Enterprise Linux Diagnostics and Troubleshooting
Study guide: Red Hat Certified Specialist in Performance Tuning (EX442)
EX442 EXAMS - TAKE

Books for this phase:

  • Systems Performance, 2nd Edition - Greg
    • Chapter 2. Methodologies
    • Chapter 3. Operating Systems
    • Chapter 4. Observability Tools
    • Chapter 5. Applications
    • Chapter 6. CPUs
    • Chapter 7. Memory
    • Chapter 8. File Systems
    • Chapter 9. Disks
    • Chapter 10. Network
    • Chapter 12. Benchmarking
    • Chapter 13. perf
    • Chapter 14. Ftrace
    • Chapter 15. BPF
  • Linux kernel development
    • 3. Process Management
      1. Process Scheduling
      1. System Calls
      1. Timers and Time Management
      1. Memory Management
      1. The Virtual Filesystem
      1. The Block I/O Layer
      1. The Process Address Space

2) 8 weeks

Red Hat Certified Specialist in Containers exam | EX188 - TAKE
Docker containerization specifically for machine learning applications.
Kubernetes:

  • CKA - STUDY ONLY
  • Kubernetes fundamentals with a focus on deploying and managing ML workloads. Observability:
  • Monitoring Systems and Services with Prometheus (LFS241)
  • monitoring systems that track infrastructure health, application performance, and ML model behavior in real-time.

Books for this phase:

  • Docker Deep Dive
    • 1: Containers from 30,000 feet
    • 2: Docker and container-related standards and projects
    • 3: Getting Docker
    • 4: The big picture
    • 5: The Docker Engine
    • 6: Working with Images
    • 7: Working with containers
    • 8: Containerizing an app
    • 9: Multi-container apps with Compose
    • 10: Docker and AI
    • 13: Docker Networking
    • 14: Docker overlay networking
    • 15: Volumes and persistent data
    • 16: Docker security
  • The Kubernetes Book
    • 1: Kubernetes primer
    • 2: Kubernetes principles of operation
    • 3: Getting Kubernetes
    • 4: Working with Pods
    • 5: Virtual clusters with Namespaces
    • 6: Kubernetes Deployments
    • 7: Kubernetes Services
    • 8: Ingress
    • 10: Service discovery deep dive
    • 11: Kubernetes storage
    • 12: ConfigMaps and Secrets
    • 13: StatefulSets
    • 14: API security and RBAC
    • 15: The Kubernetes API
    • 16: Threat modeling Kubernetes
    • 17: Real-world Kubernetes security
  • (Observability focus from) Systems Performance, 2nd Edition - Greg
    • Chapter 4. Observability Tools
    • Chapter 12. Benchmarking
    • Chapter 13. perf
    • Chapter 14. Ftrace
    • Chapter 15. BPF

3) 8 weeks

GPU:

Books for this phase:

  • AI Systems Performance Engineering
    • 2. AI System Hardware Overview
    • Chapter 3. OS, Docker, and Kubernetes Tuning for GPU-Based Environments
    • Chapter 4. Tuning Distributed Networking Communication
    • Chapter 5. GPU-Based Storage I/O Optimizations
    • Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy
    • Chapter 7: Profiling and Tuning GPU Memory Access Patterns
  • Generative AI on Kubernetes
    • Chapter 4. Kubernetes and GPUs

NVIDIA-Certified Professional (NCP) — Topic Outlines

NVIDIA-Certified Professional: AI Infrastructure (NCP-AII)

System and Server Bring-up — 31%

  • Describe sequence of events for deployment and validation.
  • Describe network topologies for AI factories.
  • Perform initial configuration of BMC, OOB, and TPM.
  • Perform firmware upgrades (including on HGX™) and fault detection.
  • Validate power and cooling parameters.
  • Install GPU-based servers (SMI).
  • Validate installed hardware.
  • Describe and validate cable types and transceivers.
  • Install physical GPUs.
  • Validate hardware operation for workloads.
  • Configure initial parameters for third-party storage.

Physical Layer Management — 5%

  • Configure and manage a BlueField® network platform.
  • Configure MIG (AI and HPC).

Control Plane Installation and Configuration — 19%

  • Install Base Command™ Manager (BCM), configure and verify HA.
  • Install OS.
  • Install Cluster (configure category, configure interfaces, install Slurm/Enroot/Pyxis).
  • Install/update/remove NVIDIA GPU and DOCA™ drivers.
  • Install the NVIDIA container toolkit.
  • Demonstrate how to use NVIDIA GPUs with Docker.
  • Install NGC™ CLI on hosts.

Cluster Test and Verification — 33%

  • Perform a single-node stress test.
  • Execute HPL (High-Performance Linpack).
  • Perform single-node NCCL (including verifying NVLink™ Switch).
  • Validate cables by verifying signal quality.
  • Confirm cabling is correct.
  • Confirm FW/SW on switches.
  • Confirm FW/SW on BlueField-3.
  • Confirm FW on transceivers.
  • Run ClusterKit to perform a multifaceted node assessment.
  • Run NCCL to verify E/W fabric bandwidth.
  • Perform NCCL burn-in.
  • Perform HPL burn-in.
  • Perform NeMo™ burn-in.
  • Test storage.

Troubleshoot and Optimize — 12%

  • Identify and troubleshoot hardware faults (e.g., GPU, fan, network card).
  • Identify faulty cards, GPUs, and power supplies.
  • Replace faulty cards, GPUs, and power supplies.
  • Execute performance optimization for AMD and Intel servers.
  • Optimize storage.

NVIDIA-Certified Professional: AI Operations

Installation and Deployment — 31%

  • Describe the Mission Control toolkit
  • Use BCM’s Base View interface to monitor cluster performance, resource utilization, and node health in real time.
  • Manage job scheduling and resource allocation using BCM’s workload manager (e.g., SLURM or Kubernetes)
  • Apply patches, update firmware, and synchronize software images across cluster nodes using BCM
  • Administer user accounts, roles, and permissions to ensure secure access to the cluster using BCM
  • Configure and monitor network settings for cluster nodes, DPUs, and switches using BCM
  • Diagnose and resolve cluster issues, such as job failures, node outages, or resource bottlenecks, using BCM.
  • Use BCM to organize and configure compute nodes into categories based on hardware or workload requirements.
  • Using BCM, maintain documentation and generate reports on cluster usage, performance, and issues.
  • Install and initialize Kubernetes on NVIDIA hosts using BCM
  • Deploy DOCA Services on DPU Arm
  • Install Run:ai
  • Install Slurm

Administration — 23%

  • Administer Slurm cluster.
  • Describe data center architecture for AI Workloads
  • Administer Run:ai
  • Administer Kubernetes
  • Configure MIG

Workload Management — 23%

  • Deploy inference workloads with Kubernetes
  • Deploy inference workloads with Run:ai
  • Deploy training workloads with Slurm
  • Deploy training workloads with Run:ai
  • Use system management tools to troubleshoot issues
  • Allocate resources between teams with Run:ai, Slurm and Kubernetes
  • Deploy containers from NGC

Troubleshooting and Optimization — 23%

  • Troubleshoot Docker
  • Troubleshoot the fabric manager service for NVLink and NVSwitch systems
  • Troubleshoot Base Command Manager
  • Troubleshoot Magnum IO components
  • Troubleshoot storage performance
  • Troubleshoot the deployment of a container from NGC

NVIDIA-Certified Professional: AI Networking

AI Data Center Design and Optimization — 5%

  • Describe an AI factory networking architecture and its components (e.g., GPUs, BlueField, Scalable Unit (SU), switches).
  • Describe rail-optimized topologies for high-performance AI workloads.
  • Describe GPU-to-GPU communications.

NVIDIA Spectrum Networking — 30%

  • Configure NVIDIA Spectrum-X switches for RoCE (RDMA over Converged Ethernet) to enable high-speed, low-latency communication.
  • Enable and verify quality of service (QoS, ECN, PFC), advanced features (like adaptive routing), and telemetry.
  • Configure multi-tenancy BGP-EVPN to isolate tenant workloads.
  • Use NVIDIA Air to simulate network environments and identify potential issues.
  • Diagnose congestion or packet loss using in-band telemetry and NVIDIA® What Just Happened® (WJH) services.
  • Use NetQ™ for real-time network monitoring, including congestion detection and latency measurements.
  • Install DOCA™.
  • Configure SuperNIC™ functionality for advanced packet processing and congestion control.

NVIDIA InfiniBand Networking — 30%

  • Perform initial configuration and provisioning, including high availability (HA).
  • Configure partition keys (PKeys) to ensure secure multi-tenancy in InfiniBand networks.
  • Configure QoS and adaptive routing to dynamically adjust paths based on congestion.
  • Use UFM to monitor InfiniBand link status and bandwidth utilization.

Kubernetes Integration — 5%

  • Deploy the NVIDIA Network Operator to manage RDMA interfaces and InfiniBand networks within Kubernetes clusters.
  • Verify NVIDIA Network Operator functionality.

Troubleshooting Tools — 20%

  • Use tools (like cl-resource-query) to check resource allocation in Spectrum-X environments.
  • Use What Just Happened (WJH) services for real-time event analysis.
  • Verify low-latency interconnects between GPUs, CPUs, and storage systems.
  • Use UFM system health to diagnose IB issues.
  • Use commands like ib_write_lat, ib_write_bw, ibping, ibstat, ibdiagnet, ibnodes, and iblinkinfo for diagnosing connectivity issues.

Automation and Configuration — 10%

  • Manage Spectrum-X switch configurations through NVUE templates.
  • Write Ansible playbooks to automate network setup tasks like VLAN creation or RoCE configuration.

4) 4 weeks

Google Cloud Platform for ML Infrastructure

Books for this phase:

  • (No specific books assigned from your list for this phase.)

5) 2 weeks

Infrastructure as Code (IaC):

Books for this phase:

  • (No specific books assigned from your list for this phase.)

6) 4 weeks

LLM Infrastructure:

Books for this phase:

  • (No specific books assigned from your list for this phase.)

7) 2 weeks

Project:

  • Project 01: Basic Model Serving System
  • Project 02: LLM Deployment Platform

Books for this phase:

  • (No specific books assigned from your list for this phase.)

Certifications — Summary

  • Red Hat

    • EX442 — TAKE (Phase 1)
    • EX188 — TAKE (Phase 2)
    • CKA — STUDY ONLY (Phase 2)
  • NVIDIA-Certified Professional (NCP) (Phase 3 — STUDY ONLY)

    • NCP-AII: AI Infrastructure — STUDY ONLY
    • NCP: AI Operations — STUDY ONLY
    • NCP: AI Networking — STUDY ONLY