Demo: train an MNIST MLP classifier in SQL#196
Open
alxmrs wants to merge 5 commits into
Open
Conversation
afb1036 to
fdb17fb
Compare
0810348 to
29b49dc
Compare
fdb17fb to
8f97173
Compare
29b49dc to
f2126da
Compare
8f97173 to
27c02d4
Compare
f2126da to
d9728c3
Compare
Base automatically changed from
claude/xarray-sql-era5-demo
to
claude/xarray-sql-autograd-73ovqq
June 30, 2026 13:31
a4fc101 to
7b1e530
Compare
Stacked demo branch (on the autograd feature) holding the runnable benchmark scripts, kept out of the core branch so it stays reviewable. * grad_era5.py: symbolic grad over real ARCO-ERA5 data (wind-speed sensitivity checked exactly; saturation vapour pressure checked against the closed-form Clausius-Clapeyron slope). The queries ORDER BY latitude DESC, longitude to match ERA5's native order, so results line up with the xarray reference with no sorting on either side (single partition, so the order survives to_dataset). * grad_descent.py: gradient descent as ONE declarative recursive-CTE query. differentiate_sql compiles the per-row update rule to SQL once; a recursive CTE then iterates it. No Python loop. Fit matches numpy least-squares. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6
A one-hidden-layer MLP (196->32 tanh->10 softmax, on 2x2-pooled 14x14 MNIST) trained by gradient descent with every gradient computed in SQL. The images are registered as xarray (the library's core); the model weights and per-step intermediates are DataFusion in-memory tables (register_record_batches), so a matmul is a join over them and there's no xarray pivot per step. Reverse-mode autodiff as relational algebra: matmul = join + GROUP BY SUM; the hidden activation's local Jacobian = grad(tanh(z), z); cotangent propagation = join; parameter gradients = join + GROUP BY AVG. The only hand-written gradient is softmax + cross-entropy's delta = softmax - onehot. ~83% test accuracy in ~20s. Adds a benchmarks README entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6
d9728c3 to
b8d3e83
Compare
Rewrite mnist_mlp.py so the whole model and its entire training history live in a single append-only table model(step, layer, i, j, val): every parameter is a row tagged by generation, and a training step appends the next generation's rows rather than mutating anything. Each step is a single SQL statement (forward, grad(tanh(z),z) backprop, parameter update); evaluation is SQL too (a forward pass with ROW_NUMBER() for the argmax). Python no longer holds the weights or computes any gradients — it only sequences the steps. A 2-layer net can't be one recursive CTE (the recursive relation may be referenced only once, but W1/W2 are used several times per step) and unrolling the steps as non-recursive CTEs blows up exponentially (DataFusion inlines CTEs; no MATERIALIZED). Materialising between steps is therefore host-driven; the thin loop does exactly that. Reaches ~83% test accuracy over 60 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the architecture itself data. The whole model is one xr.Dataset: each
layer's weight is a data_var w{L} over its boundary dims (u{L}, u{L+1}), sharing
the dims that connect adjacent layers (the join keys). The dim sizes are the
layer widths and the number of weights is the depth, so differing neuron counts
are just differing dim sizes — no padding, because the relational long form is
naturally ragged. from_dataset splits the one Dataset into a table per weight;
changing WIDTHS trains a different network with the same code.
One generic contract()-based loop trains a net of any depth: forward contracts
each layer, backward is the same contraction transposed (VJP of a contraction is
a contraction) with grad(tanh(z), z) for the local derivative. Validated exact
against numpy at depth 3.
Training metrics are a relation too: each logged step appends a
(step, loss, train_acc, test_acc) row to a metrics table rather than a Python
list. The trained model, predictions, and metrics all come back out as xarray
via to_dataset. ~83% test accuracy in ~13s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two simplifications collapse the model to a single relation: - Bias folded into the weights (an nn.Linear): each layer's bias is the weight of a constant-1 input, kept as the row inp=width of the same weight array, so a layer is one matrix. - A layer dimension: every layer's weight lives in one weight(layer, inp, out) array, so forward/backward filter on the layer COLUMN instead of referencing a table per layer. The model is one xr.Dataset with a layer dim (NaN-padded for the ragged pyramid, dropped on seed); from_dataset registers it; the update is one query over the whole weight relation. A single contract() and a generic loop train a net of any depth (validated exact against numpy at depth 3). Tensors.put now unifies batch nullability so UNION results register cleanly. Faster too (~6s vs ~13s) at the same ~83% test accuracy; model and metrics still round-trip to xarray. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on the ERA5/gradient-descent demo branch (#195). Adds
benchmarks/mnist_mlp.py.What this is
A one-hidden-layer MLP (196 → 32 tanh → 10 softmax, on 2×2-pooled 14×14 MNIST) trained by gradient descent where every gradient is computed in SQL over data registered as xarray. The optimisation loop is plain Python; all the math is relational.
Reverse-mode autodiff expressed as relational algebra:
GROUP BY SUM— a layer's pre-activation isSUM(input · weight)grouped by (sample, unit).grad()— the hidden activation's Jacobian isgrad(tanh(z), z), the autograd feature doing the calculus per (sample, unit).GROUP BY AVG.The only hand-written gradient is softmax + cross-entropy's
delta = softmax - onehot(softmax couples classes through a per-sample normaliser, an aggregategraddoes not cross — staying faithful to SQL).Reaches ~83% test accuracy in ~45s; downloads MNIST on first run. PEP 723 inline deps,
uv run benchmarks/mnist_mlp.py.🤖 Generated with Claude Code
Generated by Claude Code