Atandra Bharati atandra2000

Atandra Bharati

Deep Learning Research Engineer — building frontier AI architectures from scratch in raw PyTorch.

LLMs · Latent Diffusion · Multimodal · Video Understanding · Agentic ML · State-Space Models · Long-Context Attention

14 from-scratch projects · 78% memory optimization · 878-test agentic platform · 860M-param UNet trained from random init

🎯 Open To

Deep Learning Research Engineer · LLM Engineer · GenAI / Diffusion Engineer · Agentic ML Engineer

Remote-friendly · Available worldwide

🧭 Now

Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and just released two new long-context / state-space reproductions: GPT-OSS-Lite (sliding/full attention alternation + learned sinks, 2× KV-cache cut at 128K) and Mamba-3-Lite (complex-valued SSD + MIMO mixing, zero causal conv).

🛠️ Stack

Languages & ML core

Architectures
Transformers · GQA · MLA · Sliding/Full Attention Alternation · Learned Attention Sinks · YaRN RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · SSD (real & complex64) · MIMO head mixing · Diffusion UNet · VAE · GAN · CycleGAN · AdaIN · ST-GCN · HRNet · SigLIP

Optimization & numerics
BF16 · FP16 · FP8 · Flash Attention 2 · SDPA · torch.compile · channels_last · Gradient checkpointing · μP scaling · WSD LR · NorMuon · CautiousAdamW · Chunked cross-entropy · Disk-backed token caching · Fused optimizers · Chinchilla-optimal scaling

Hardware validated
A100 80GB · RTX 5090 (Blackwell) · RTX 6000 Ada · RTX 3090 · P100 · 2× T4

Tooling
HuggingFace · diffusers · tiktoken · W&B · Comet · safetensors · ONNX · TensorRT · FastAPI · pydantic v2 · ChromaDB · Ollama Cloud

🏆 Highlights

78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from random init across a 7-phase curriculum on 2× RTX 5090; epoch-42 checkpoint released on HuggingFace.
2× KV-cache reduction at 128K context in GPT-OSS-Lite via sliding-window(128) / full-attention alternation with learned attention-sink bias and YaRN RoPE — verified at 1.13 GB vs 2.25 GB pure GQA (BF16).
Complex64 SSD with 50% smaller state (N=64 vs Mamba-2's N=128) achieves parity loss on the same 8.0B-token Chinchilla run, plus MIMO inter-head mixing and zero causal convolution — pure PyTorch, no custom CUDA.
~30 FPS inference on RTX 3090 for skeleton-based action recognition, served via ONNX + TensorRT + FastAPI.
878 passing tests · 15 cooperating phases · 23 agents · 61 tools · 186 models in the Autonomous ML Research Engineer platform — full paper-to-conclusions loop with self-repair and provider-agnostic LLM routing.
415.6M active / 868.6M stored params in FusionLLM — a novel hybrid of MLA + Gated Delta Net + MoE + MTP in a 24-layer decoder.
643-line technical deep-dive on MLA (Multi-Head Latent Attention) covering KV-cache math, low-rank compression, the absorption-trick derivation, and decoupled RoPE mechanics.

📂 Projects

Domain	Project	Highlight	Hardware	Repo
LLM	GPT-OSS-Lite (502M / 247M active)	Sliding(128)/Full attention alt · learned sink bias · YaRN 128K · top-2-of-8 MoE · 2× KV-cache cut at 128K · 130 tests	A100 80GB	→
LLM	Mamba-3-Lite (404M)	Complex64 SSD (N=64) · MIMO head mixing · zero causal conv · pure PyTorch (no `mamba-ssm`, no custom CUDA)	A100 80GB	→
LLM	DeepSeek-v3-Lite (422M)	MLA + AuxLossFreeGate MoE + MTP, end-to-end with absorption-trick inference	A100 80GB	→
LLM	LLaMA-3-Lite (515M)	GQA · RoPE θ=500K · SwiGLU · RMSNorm · FA2 · chunked CE · 78% memory cut	A100 80GB	→
LLM	FusionLLM (415.6M / 868.6M)	Novel MLA + Gated Delta Net + MoE + MTP hybrid · NorMuon + CautiousAdamW · WSD	A100 80GB	→
LLM	GPT-From-Scratch (~6M)	Educational GPT-style decoder · 4 layers · char-level tokenizer · loss 8.69 → 0.83 · HF weight loading	P100 / CUDA / MPS	→
LLM	TranslationLM (EN→IT)	Encoder–decoder Transformer · loss 6.17 → 2.28 · BLEU/CER/WER	P100	→
Vision	Stable Diffusion 1.x (860M UNet)	Custom UNet trained from random init · 7 phases · 1.3M+ images · best loss 0.0947 · epoch-42 checkpoint on HF	2× RTX 5090	→
Vision	ActionRecognition (120 cls)	HRNet pose + Two-Stream CTR-GCN · ~30 FPS · ONNX + TensorRT	RTX 3090	→
Vision	FaceAgingCycleGAN (256²)	Per-layer AdaIN conditioning · 3-scale PatchGAN · LSGAN + R1 GP	RTX 6000 Ada	→
Vision	FaceGenerationVAE (β-VAE)	50 epochs · recon MSE 0.0152 · linear KL annealing · bilinear-upsample decoder	P100	→
Vision	DCGAN-Face-Generation	50 epochs · 202K CelebA · D loss → ln 2 ≈ 0.693 equilibrium	2× T4	→
Multimodal	VisionLangModel (PaliGemma-style)	SigLIP ViT + Gemma decoder + linear projector · zero pretrained weights	P100	→
Agentic	Autonomous ML Research Engineer	15-phase multi-agent platform · paper → plan → patch → train → evaluate → report · provider-agnostic LLM routing	Local + Ollama Cloud	→

✍️ Writing

Multi-Head Latent Attention — A Technical Deep-Dive — 643-line reference covering KV-cache math, low-rank compression algebra, the absorption-trick derivation, decoupled RoPE mechanics, and SDPA vs manual attention trade-offs in DeepSeek-V2/V3.
Attention Sinks — StreamingLLM for GPT-OSS — 600-line reference on the learned per-head sink bias, its BF16 numerical-stability story (clamped to [-10, 15]), and its interaction with sliding/full attention alternation.
State-Space Duality — The Mamba-3 Chunkwise SSD — full derivation of the chunkwise SSD algorithm and its equivalence to the naive O(T) recurrence reference.

🔬 Engineering Themes

From-scratch PyTorch — no Trainer, no Lightning, no accelerate; every layer written by hand
Single-GPU feasibility — BF16, gradient checkpointing, FA2, channels_last, fused optimizers
Faithful reproductions — DeepSeek-V3, LLaMA-3, GPT-OSS, Mamba-3, PaliGemma, DCGAN implemented to the paper
Novel hybrids — FusionLLM (MLA + GDN + MoE + MTP), FaceAgingCycleGAN (AdaIN-conditioned CycleGAN), GPT-OSS-Lite (sliding/full alt + learned sinks + YaRN)
Production hygiene — atomic checkpoints (.tmp.pt → os.rename), full RNG-state reproducibility, W&B / Comet tracking, CI lint + tests
Data pipelines — resumable download → filter → tokenize → shard → streaming loader, with dedup and document packing; universal 8.0B-token shared pipeline across all LLM projects
Post-training & inference — speculative decoding (MTP-as-draft), Min-SNR loss weighting, EMA, classifier-free guidance
Hardware breadth — MPS / CPU → Kaggle T4 / P100 → A100 80GB → 2× RTX 5090 → RTX 6000 Ada

🎓 Background

B.Tech, 2024 · Heritage Institute of Technology, Kolkata. Self-taught in deep learning through two years of from-scratch implementation — engineering discipline from infrastructure and constraint work translates directly to memory budgets, distributed training, and reproducible ML systems.

📫 Connect

_{Last updated 2026-06-29 · 14 projects · Open to remote and on-site roles}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly