Skip to content
View atandra2000's full-sized avatar
💭
Learning has no ending
💭
Learning has no ending

Block or report atandra2000

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
atandra2000/README.md

Atandra Bharati

Deep Learning Research Engineer — building frontier AI architectures from scratch in raw PyTorch.

LLMs · Latent Diffusion · Multimodal · Video Understanding · Agentic ML · State-Space Models · Long-Context Attention

14 from-scratch projects · 78% memory optimization · 878-test agentic platform · 860M-param UNet trained from random init


🎯 Open To

Deep Learning Research Engineer · LLM Engineer · GenAI / Diffusion Engineer · Agentic ML Engineer

Remote-friendly · Available worldwide


🧭 Now

Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and just released two new long-context / state-space reproductions: GPT-OSS-Lite (sliding/full attention alternation + learned sinks, 2× KV-cache cut at 128K) and Mamba-3-Lite (complex-valued SSD + MIMO mixing, zero causal conv).


🛠️ Stack

Languages & ML core
Python PyTorch CUDA

Architectures
Transformers · GQA · MLA · Sliding/Full Attention Alternation · Learned Attention Sinks · YaRN RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · SSD (real & complex64) · MIMO head mixing · Diffusion UNet · VAE · GAN · CycleGAN · AdaIN · ST-GCN · HRNet · SigLIP

Optimization & numerics
BF16 · FP16 · FP8 · Flash Attention 2 · SDPA · torch.compile · channels_last · Gradient checkpointing · μP scaling · WSD LR · NorMuon · CautiousAdamW · Chunked cross-entropy · Disk-backed token caching · Fused optimizers · Chinchilla-optimal scaling

Hardware validated
A100 80GB · RTX 5090 (Blackwell) · RTX 6000 Ada · RTX 3090 · P100 · 2× T4

Tooling
HuggingFace · diffusers · tiktoken · W&B · Comet · safetensors · ONNX · TensorRT · FastAPI · pydantic v2 · ChromaDB · Ollama Cloud


🏆 Highlights

  • 78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
  • Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from random init across a 7-phase curriculum on 2× RTX 5090; epoch-42 checkpoint released on HuggingFace.
  • 2× KV-cache reduction at 128K context in GPT-OSS-Lite via sliding-window(128) / full-attention alternation with learned attention-sink bias and YaRN RoPE — verified at 1.13 GB vs 2.25 GB pure GQA (BF16).
  • Complex64 SSD with 50% smaller state (N=64 vs Mamba-2's N=128) achieves parity loss on the same 8.0B-token Chinchilla run, plus MIMO inter-head mixing and zero causal convolution — pure PyTorch, no custom CUDA.
  • ~30 FPS inference on RTX 3090 for skeleton-based action recognition, served via ONNX + TensorRT + FastAPI.
  • 878 passing tests · 15 cooperating phases · 23 agents · 61 tools · 186 models in the Autonomous ML Research Engineer platform — full paper-to-conclusions loop with self-repair and provider-agnostic LLM routing.
  • 415.6M active / 868.6M stored params in FusionLLM — a novel hybrid of MLA + Gated Delta Net + MoE + MTP in a 24-layer decoder.
  • 643-line technical deep-dive on MLA (Multi-Head Latent Attention) covering KV-cache math, low-rank compression, the absorption-trick derivation, and decoupled RoPE mechanics.

📂 Projects

Domain Project Highlight Hardware Repo
LLM GPT-OSS-Lite (502M / 247M active) Sliding(128)/Full attention alt · learned sink bias · YaRN 128K · top-2-of-8 MoE · 2× KV-cache cut at 128K · 130 tests A100 80GB
LLM Mamba-3-Lite (404M) Complex64 SSD (N=64) · MIMO head mixing · zero causal conv · pure PyTorch (no mamba-ssm, no custom CUDA) A100 80GB
LLM DeepSeek-v3-Lite (422M) MLA + AuxLossFreeGate MoE + MTP, end-to-end with absorption-trick inference A100 80GB
LLM LLaMA-3-Lite (515M) GQA · RoPE θ=500K · SwiGLU · RMSNorm · FA2 · chunked CE · 78% memory cut A100 80GB
LLM FusionLLM (415.6M / 868.6M) Novel MLA + Gated Delta Net + MoE + MTP hybrid · NorMuon + CautiousAdamW · WSD A100 80GB
LLM GPT-From-Scratch (~6M) Educational GPT-style decoder · 4 layers · char-level tokenizer · loss 8.69 → 0.83 · HF weight loading P100 / CUDA / MPS
LLM TranslationLM (EN→IT) Encoder–decoder Transformer · loss 6.17 → 2.28 · BLEU/CER/WER P100
Vision Stable Diffusion 1.x (860M UNet) Custom UNet trained from random init · 7 phases · 1.3M+ images · best loss 0.0947 · epoch-42 checkpoint on HF 2× RTX 5090
Vision ActionRecognition (120 cls) HRNet pose + Two-Stream CTR-GCN · ~30 FPS · ONNX + TensorRT RTX 3090
Vision FaceAgingCycleGAN (256²) Per-layer AdaIN conditioning · 3-scale PatchGAN · LSGAN + R1 GP RTX 6000 Ada
Vision FaceGenerationVAE (β-VAE) 50 epochs · recon MSE 0.0152 · linear KL annealing · bilinear-upsample decoder P100
Vision DCGAN-Face-Generation 50 epochs · 202K CelebA · D loss → ln 2 ≈ 0.693 equilibrium 2× T4
Multimodal VisionLangModel (PaliGemma-style) SigLIP ViT + Gemma decoder + linear projector · zero pretrained weights P100
Agentic Autonomous ML Research Engineer 15-phase multi-agent platform · paper → plan → patch → train → evaluate → report · provider-agnostic LLM routing Local + Ollama Cloud

✍️ Writing


🔬 Engineering Themes

  • From-scratch PyTorch — no Trainer, no Lightning, no accelerate; every layer written by hand
  • Single-GPU feasibility — BF16, gradient checkpointing, FA2, channels_last, fused optimizers
  • Faithful reproductions — DeepSeek-V3, LLaMA-3, GPT-OSS, Mamba-3, PaliGemma, DCGAN implemented to the paper
  • Novel hybrids — FusionLLM (MLA + GDN + MoE + MTP), FaceAgingCycleGAN (AdaIN-conditioned CycleGAN), GPT-OSS-Lite (sliding/full alt + learned sinks + YaRN)
  • Production hygiene — atomic checkpoints (.tmp.ptos.rename), full RNG-state reproducibility, W&B / Comet tracking, CI lint + tests
  • Data pipelines — resumable download → filter → tokenize → shard → streaming loader, with dedup and document packing; universal 8.0B-token shared pipeline across all LLM projects
  • Post-training & inference — speculative decoding (MTP-as-draft), Min-SNR loss weighting, EMA, classifier-free guidance
  • Hardware breadth — MPS / CPU → Kaggle T4 / P100 → A100 80GB → 2× RTX 5090 → RTX 6000 Ada

🎓 Background

B.Tech, 2024 · Heritage Institute of Technology, Kolkata. Self-taught in deep learning through two years of from-scratch implementation — engineering discipline from infrastructure and constraint work translates directly to memory budgets, distributed training, and reproducible ML systems.


📫 Connect

Portfolio LinkedIn GitHub W&B Kaggle Comet Email


Last updated 2026-06-29 · 14 projects · Open to remote and on-site roles

Pinned Loading

  1. StableDiffusion StableDiffusion Public

    A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2× RTX 5090 (Blackwell) GPUs. Full UNet (~860M params), DDPM/DDIM, LAION pipeline, DDP+BF16.

    Python

  2. DeepSeek-v3-Lite DeepSeek-v3-Lite Public

    Faithful from-scratch reimplementation of DeepSeek-V3 (MLA + MoE + MTP), scaled for Chinchilla-optimal 422M training on a single A100 80GB

    Python 1