From Silicon to Softmax
A structured curriculum from bare-metal systems programming to distributed GPU clusters — the path from writing apps to making $100B data centers work.
8 modules · 5 lessons
Module 1: The Low-Level Foundation
Systems programming in Rust, CPU architecture, SIMD, memory hierarchy, and Linux performance profiling.
Module 2: GPU & Parallelism
CUDA programming, Triton kernels, parallel algorithms, kernel fusion, and FlashAttention.
Coming soon
Module 3: Distributed Systems
RDMA, InfiniBand, NCCL, distributed training with DDP/FSDP, and the 3D parallelism grid.
Coming soon
Module 4: ML Internals & Optimization
Quantization, inference optimization, and the Rust GPU frontier.
Coming soon
Module 5: Cluster Orchestration
Kubernetes for ML, Slurm, Volcano/Kueue, MPI Operator, KubeRay, topology-aware scheduling, multi-tenancy — running training jobs on shared GPU clusters.
Module 6: ML Platform Engineering
Experiment tracking, model registry, training observability, workflow orchestration, CI/CD for models, cost attribution — the infrastructure that turns one good training run into a reliable model factory.
Module 7: Inference from Scratch
A tutorial series on modern LLM inference — attention variants (GQA, MLA), positional encodings, KV cache, Mixture of Experts, multi-token prediction, and serving internals.
Module 8: Agents from Scratch
A tutorial series on building production-grade LLM agents — the agent loop, tool use, memory, RAG, planning, context engineering, multi-agent systems, and distributed agent infrastructure.