ML Systems & Infrastructure

Dr. Sudipta Pathak

Building scalable ML platforms, distributed training systems, and AI infrastructure. Expert in MLOps, model optimization, and production-grade machine learning.

GitHub Twitter/X LinkedIn Email

View My Work Get in Touch

Scroll to explore

About Me

I am a Machine Learning Systems Engineer with deep expertise in building production-grade AI infrastructure. My work spans from distributed training systems to edge deployment, with a focus on scalability, efficiency, and reliability.

Distributed Training

Large-scale model training across GPU clusters with optimized data parallelism and model parallelism strategies.

ML Infrastructure

Building robust ML platforms with Kubernetes, Kubeflow, and custom orchestration for production workloads.

Data Pipelines

High-throughput data processing pipelines with Apache Spark, Ray, and modern data lake architectures.

Cloud-Native ML

Multi-cloud deployments on AWS, GCP, and Azure with focus on cost optimization and scalability.

Model Optimization

Quantization, pruning, and distillation techniques for deploying efficient models at the edge and cloud.

MLOps & Security

End-to-end ML lifecycle management with CI/CD, monitoring, and enterprise-grade security practices.

10+

Years Experience

50+

Production Systems

1B+

Model Parameters

99.9%

Uptime Achieved

Featured Work

Selected projects showcasing expertise in ML systems, distributed computing, and production AI infrastructure.

Distributed LLM Training Framework

A fault-tolerant distributed training system for large language models with automatic checkpointing and elastic scaling.

PyTorchRayKubernetesDeepSpeed

ML Inference Optimization Platform

High-performance inference serving with dynamic batching, model quantization, and multi-model GPU sharing.

TensorRTTritonCUDAgRPC

Real-time Feature Store

Low-latency feature computation and serving for online ML models with point-in-time correctness guarantees.

RedisApache FlinkKafkaRust

MLOps Pipeline Orchestrator

End-to-end ML workflow automation with experiment tracking, model versioning, and automated deployment.

KubeflowMLflowGitOpsTerraform

View All Projects on GitHub

Writings

Technical blog posts, paper reviews, and tutorials on ML systems, infrastructure, and distributed computing.

blog

Jan 2026•15 min read

Scaling Distributed Training to 10,000 GPUs

A deep dive into the challenges and solutions for training large models at extreme scale, including communication optimizations and fault tolerance strategies.

papers

Jan 2026•10 min read

Paper Review: MegaScale - Production-Grade LLM Training

Analysis of ByteDance's MegaScale framework for training LLMs on over 10,000 GPUs, covering their design principles and lessons learned.

tutorials

Dec 2025•25 min read

Building Production-Ready ML Pipelines with Kubeflow

Step-by-step tutorial on setting up end-to-end ML workflows with Kubeflow Pipelines, including best practices for CI/CD integration.

blog

Dec 2025•12 min read

Optimizing Transformer Inference at Scale

Techniques for reducing latency and increasing throughput in production transformer serving, including KV-cache optimization and speculative decoding.

papers

Nov 2025•8 min read

Paper Review: Efficient Large-Scale Language Model Training

Review of recent advances in efficient training methods including mixture of experts, activation checkpointing, and 8-bit optimizers.

tutorials

Nov 2025•30 min read

Introduction to Triton Kernels for Deep Learning

Hands-on tutorial for writing custom GPU kernels using OpenAI's Triton, from basics to advanced optimizations.

View All Writings

ML News & Tools

Stay updated with the latest breaking news, research papers, and essential tools in the ML systems and infrastructure space.

Latest News

BreakingJan 2026

OpenAI Releases GPT-5 Architecture Details

New technical report reveals innovations in mixture of experts scaling and training efficiency improvements.

OpenAI ResearchRead more

ToolsJan 2026

PyTorch 3.0 Introduces Native Distributed Checkpointing

Major release brings built-in support for async checkpointing and automatic fault recovery for large-scale training.

PyTorch BlogRead more

HardwareDec 2025

NVIDIA Announces Blackwell Ultra GPU Series

New architecture delivers 4x performance for transformer inference with dedicated FP4 compute units.

NVIDIA NewsRead more

Essential Tools

vLLM

★ 28k

High-throughput LLM inference with PagedAttention

Inference

Unsloth

★ 18k

2-5x faster LLM fine-tuning with 80% less memory

Training

llama.cpp

★ 65k

Port of LLaMA models in C/C++ for edge deployment

Optimization

Ollama

★ 82k

Run LLMs locally with simple CLI and API

Local AI

Weekly ML Systems Newsletter

Get curated news, paper summaries, and tool recommendations delivered weekly.

Get in Touch

Whether you are looking for a senior ML Systems engineer, need consulting on AI infrastructure, or want to collaborate on research, I would love to hear from you.

Job Opportunities

Open to senior ML Systems and Infrastructure roles

View Resume

Consulting

Expert consulting on ML infrastructure and MLOps

Book a Call

Collaboration

Interested in research collaborations or open source

Let's Talk

General Inquiry

Questions, feedback, or just want to say hi

Send Email