Scaling Distributed Training to 10,000 GPUs
A deep dive into the challenges and solutions for training large models at extreme scale, including communication optimizations and fault tolerance strategies.
Read moreBuilding scalable ML platforms, distributed training systems, and AI infrastructure. Expert in MLOps, model optimization, and production-grade machine learning.
I am a Machine Learning Systems Engineer with deep expertise in building production-grade AI infrastructure. My work spans from distributed training systems to edge deployment, with a focus on scalability, efficiency, and reliability.
Large-scale model training across GPU clusters with optimized data parallelism and model parallelism strategies.
Building robust ML platforms with Kubernetes, Kubeflow, and custom orchestration for production workloads.
High-throughput data processing pipelines with Apache Spark, Ray, and modern data lake architectures.
Multi-cloud deployments on AWS, GCP, and Azure with focus on cost optimization and scalability.
Quantization, pruning, and distillation techniques for deploying efficient models at the edge and cloud.
End-to-end ML lifecycle management with CI/CD, monitoring, and enterprise-grade security practices.
Selected projects showcasing expertise in ML systems, distributed computing, and production AI infrastructure.
A fault-tolerant distributed training system for large language models with automatic checkpointing and elastic scaling.
High-performance inference serving with dynamic batching, model quantization, and multi-model GPU sharing.
Low-latency feature computation and serving for online ML models with point-in-time correctness guarantees.
Technical blog posts, paper reviews, and tutorials on ML systems, infrastructure, and distributed computing.
A deep dive into the challenges and solutions for training large models at extreme scale, including communication optimizations and fault tolerance strategies.
Read moreAnalysis of ByteDance's MegaScale framework for training LLMs on over 10,000 GPUs, covering their design principles and lessons learned.
Read moreStep-by-step tutorial on setting up end-to-end ML workflows with Kubeflow Pipelines, including best practices for CI/CD integration.
Read moreTechniques for reducing latency and increasing throughput in production transformer serving, including KV-cache optimization and speculative decoding.
Read moreReview of recent advances in efficient training methods including mixture of experts, activation checkpointing, and 8-bit optimizers.
Read moreHands-on tutorial for writing custom GPU kernels using OpenAI's Triton, from basics to advanced optimizations.
Read moreStay updated with the latest breaking news, research papers, and essential tools in the ML systems and infrastructure space.
New technical report reveals innovations in mixture of experts scaling and training efficiency improvements.
Major release brings built-in support for async checkpointing and automatic fault recovery for large-scale training.
New architecture delivers 4x performance for transformer inference with dedicated FP4 compute units.
High-throughput LLM inference with PagedAttention
Inference2-5x faster LLM fine-tuning with 80% less memory
TrainingPort of LLaMA models in C/C++ for edge deployment
OptimizationRun LLMs locally with simple CLI and API
Local AIGet curated news, paper summaries, and tool recommendations delivered weekly.
Whether you are looking for a senior ML Systems engineer, need consulting on AI infrastructure, or want to collaborate on research, I would love to hear from you.
Open to senior ML Systems and Infrastructure roles
View ResumeExpert consulting on ML infrastructure and MLOps
Book a CallInterested in research collaborations or open source
Let's TalkQuestions, feedback, or just want to say hi
Send Email