Dictionary-encode coordinate columns by alxmrs · Pull Request #217 · xqlsystems/xarray-sql

alxmrs · 2026-07-02T16:27:06Z

Encode wide coordinate columns as Arrow dictionaries so the query engine moves fewer bytes for them and groups/joins on integer keys instead of rehashing repeated values.

Prior art

This follows @jayendra13's dictionary-encoding idea in zarr-datafusion.

Why

A dense grid repeats every coordinate value across the whole partition — a chunk of shape (time, lat, lon) carries each latitude time × lon times. The reader materialized coordinate columns as dense, fully-repeated Arrow arrays (iter_record_batches), so every GROUP BY / JOIN on a coordinate re-hashed a hugely redundant column and the pivot moved more bytes for coordinates than for the data itself.

This is the pivot/round-trip tax that the recent exact-statistics work (#201) couldn't touch — stats fixed plan choice; this attacks the actual data volume.

What

df.py — _parse_schema dictionary-encodes a dimension coordinate when it is safe and worthwhile:
- The key type is int32 (int64 backstop for astronomically large axes). A narrower key (int8/int16) can overflow: DataFusion concatenates the per-batch coordinate dictionaries across a streaming aggregate and does not always unify them, so an unchunked coordinate repeated across N partitions can reach card × N and blow past a narrow key.
- A coordinate is encoded only when the int32 key is strictly narrower than the value type — 8-byte float64/int64/timestamp (and variable-width strings). 4-byte float32/int32 coordinates stay dense (a dictionary there would be pure overhead, and a narrower key is the unsafe one). Value type and cftime metadata are preserved.
- iter_record_batches / dataset_to_record_batch emit DictionaryArrays for the encoded columns; the strided per-row indices they already compute are the dictionary indices, so no dense array of repeated values is built.
src/lib.rs — keep partition pruning and statistics correct on the encoding. DataFusion coerces a coordinate filter to either a Dictionary literal (timestamps) or Cast(col AS value_type) (numerics), so compare_to_scalar unwraps ScalarValue::Dictionary, the pruning matchers strip_cast the lossless decode cast, and bound_to_scalar unwraps DataType::Dictionary. total_byte_size is reported Inexact when any column is dictionary-encoded (it sizes those by the value type — a safe upper bound — not the narrower index).
tests — test_dict_coords.py pins which coordinates get encoded, a group-by round-trip, and the overflow regression (float32 GROUP BY lat, lon over 100 partitions). cftime schema assertions, memory bounds, and the stats byte-size expectation updated.

Measured

Coordinate-column bytes on an ERA5-shaped chunk (time=6, lat=721, lon=1440, ~6.2M rows):

grid	encoded columns	coord bytes vs dense
realistic (float32 lat/lon, timestamp time)	time only	1.33×
float64 coordinates	time, lat, lon	2.0×

The win is on the 8-byte coordinate columns (int32 key = half the width). float32 grids benefit only on their timestamp column; that's the safe ceiling — the larger multiples from narrow (int8/int16) keys were the part that overflowed under streaming aggregation.

Correctness / risk

The make-or-break risks were verified: partition pruning (a time filter still prunes to a single partition), the round-trip back to xarray (values decode via to_numpy), and the streaming-aggregate overflow reported by @ghostiee-11 (float32 GROUP BY lat, lon) — now regression-tested and network-free so CI covers it. Full suite green (186) plus cargo fmt/clippy/test and ruff/mypy.

Possible follow-up: recover a larger win for chunked dimensions safely — there the combined dictionary is bounded by the full cardinality, so a narrow key can't overflow.

🤖 Generated with Claude Code

https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

alxmrs · 2026-07-02T16:54:05Z

This PR follows @jayendra13's idea in zarr-datafusion.

alxmrs · 2026-07-02T16:54:55Z

Thanks — added a "Prior art" note to the description crediting @jayendra13's dictionary-encoding idea in zarr-datafusion. Happy to point at a specific file/commit there if you'd like a more precise reference.

Generated by Claude Code

alxmrs

LGTM so far. Will have an agent review this too.

jayendra13 · 2026-07-02T17:04:26Z

https://github.com/jayendra13/zarr-datafusion/blob/main/docs/design-decisions-code-map.md#4-adaptive-dictionary-key-width-int16--int32--int64

Additional optimisation if you need it.

alxmrs · 2026-07-02T17:07:32Z

Thanks @jayendra13 — good call. The PR already picked the index width adaptively per coordinate, but it capped at int32, which (as your doc notes) would silently overflow the index for a >2.1B-cardinality axis. Added the int64 fallback in 051ffac and tightened the thresholds to the exact MAX + 1 boundaries (indices run 0..n-1, so a signed max M holds n = M+1): int8 → int16 → int32 → int64. One small difference from your rule — we also drop to int8 for tiny axes (e.g. a 6-step time chunk → 1 byte/row). Covered by a boundary unit test.

Generated by Claude Code

alxmrs · 2026-07-02T17:16:28Z

Hey @ghostiee-11 (and Jayendra if available) may I have your review on this PR?

ghostiee-11

Pulled the branch and ran it against datafusion 54. Mostly good, but test_streaming_aggregation_does_not_explode fails for me every time:

DataFusion error: Arrow error: Dictionary key bigger than the key type

Boiled it down to float32 coordinates. This crashes, flip the coords to float64 and it passes:

import numpy as np, xarray as xr
from xarray_sql import XarrayContext

ds = xr.Dataset(
    {"air": (("time", "lat", "lon"), np.random.rand(2920, 25, 53).astype("float32"))},
    coords={
        "time": np.arange(2920),
        "lat": np.linspace(-90, 90, 25).astype("float32"),
        "lon": np.linspace(0, 359, 53).astype("float32"),
    },
)
ctx = XarrayContext()
ctx.from_dataset("air", ds, chunks={"time": 24})
ctx.sql("SELECT lat, lon, AVG(air) AS m FROM air GROUP BY lat, lon").to_pandas()

One partition is fine, ~120 blows up, so it's float32 keys plus enough batches. _coord_index_type picks the key width from one dimension's size (int8 for lat=25), which is right for a single read, but the grouped aggregate builds a Dictionary(_, Float32) across batches and the index runs past int8 instead of unifying. jayendra13's doc only covers single-read width too, so the prior art doesn't help here.

Real grids are almost always float32 (air_temperature, ERA5) and GROUP BY lat, lon is the main thing people do, so this one bites. The test that catches it pulls air_temperature over the network, so I'm guessing CI skipped it and that's why it looked green. What datafusion/arrow versions did you run against?

Smaller thing: with show_statistics on, the scan reports Bytes=Exact(40000) but the encoded coords are int8 indices, so the real size is ~8x smaller. Fine as an upper bound, just not "exact" anymore after #201.

Rest holds up: pruning still cuts WHERE time = 5 from 2000 rows to 200, min/max decode through the dictionary, round-trip is exact, and the int8/16/32/64 boundaries are right.

alxmrs · 2026-07-02T20:43:38Z

Thanks @ghostiee-11 — this was a real bug and a great catch. Fixed in 6f57863.

Root cause (exactly your diagnosis): a narrow key can't survive DataFusion concatenating the per-batch coordinate dictionaries across a streaming aggregate. arrow only sometimes unifies (it's a size heuristic in should_merge_dictionary_values, not a guarantee), so an unchunked coordinate repeated across N partitions can reach card × N and overflow the key. It doesn't reproduce on the pinned arrow-58 here (the heuristic happens to merge), which is why CI looked green — but a heuristic isn't a contract, so I treated it as real regardless. (For the record: datafusion==54.0.0, arrow 58.)

Fix — make it overflow-proof and only encode where it actually pays:

_coord_index_type floors the key at int32 (covers ~2.1B combined entries — any realistic grid; int64 backstops the rest). int8/int16 are gone.
_as_dictionary_field encodes a coordinate only when the int32 key is strictly narrower than the value: 8-byte float64/int64/timestamp (and variable-width strings) get encoded; 4-byte float32/int32 stay dense. That's your crashing case — float32 lat/lon are no longer dictionary-encoded, so there's no coordinate dictionary to concatenate.
Regression test added: GROUP BY lat, lon on float32 over 100 partitions matches xarray (network-free, so CI actually covers it now).

Honest consequence on the headline: the big multiples came from the narrow keys, which were the unsafe part. Safe encoding is 2× on 8-byte coordinate columns and ~1.33× on a realistic float32 ERA5 grid (only the timestamp column is 8-byte; float32 lat/lon stay dense). Smaller, but correct. A follow-up could recover more safely for chunked dimensions (where the combined dictionary is bounded by the full cardinality, so a narrow key is safe) — happy to explore that separately.

On the stats point — good call, fixed too: total_byte_size now reports Bytes=Inexact(...) whenever a column is dictionary-encoded, since it sizes those by the value type (a safe upper bound) rather than the narrower index. And thanks for confirming pruning / min-max decode / round-trip / boundaries still hold.

Generated by Claude Code

A dense grid repeats every coordinate value across the whole partition (a chunk of shape (time, lat, lon) carries each latitude time×lon times). The reader materialized coordinate columns as dense, fully-repeated Arrow arrays, so every GROUP BY / JOIN on a coordinate re-hashed a hugely redundant column and the pivot moved far more bytes than the data itself. Encode coordinate columns as Arrow dictionaries: the distinct coordinate values are the dictionary and the strided per-row indices we already compute are the dictionary indices — no broadcast of repeated values. The index type is sized to the dimension length (int8/int16/int32), so a 6-step time chunk uses 1 byte/row and a 721/1440-point lat/lon uses 2. On an ERA5-shaped chunk the coordinate columns shrink ~4.8x (and equality GROUP BY / JOIN keys become small integers). - df.py: _parse_schema declares dimension coordinates as dictionary(index, value) with the value type/metadata preserved; iter_record_batches and dataset_to_record_batch emit DictionaryArrays. - src/lib.rs: keep partition pruning and exact statistics working on the new encoding — DataFusion coerces a coordinate filter to either a Dictionary literal (timestamp) or Cast(col AS value_type) (float), so compare_to_scalar unwraps Dictionary scalars, the pruning matchers strip decode casts, and bound_to_scalar / total_byte_size unwrap the dictionary value type. - tests: pin the encoding contract and the group-by round-trip; update schema and memory-characterization expectations to the (smaller) encoded columns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

Cap coordinate dictionary indices at the correct width for any cardinality: the selection now tiers int8 → int16 → int32 → int64 at exact `MAX + 1` boundaries (indices run 0..n-1, so a signed max M holds n = M+1). The int64 fallback keeps astronomically large coordinate axes representable instead of silently overflowing a 32-bit index. Follows the adaptive-key-width note in jayendra13's zarr-datafusion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

@ghostiee-11

A narrow dictionary key (int8/int16) can overflow under DataFusion streaming aggregation: it concatenates the per-batch coordinate dictionaries across the aggregate and does not always unify them (arrow merges dictionary values on a size heuristic, not a guarantee), so the combined index for an unchunked coordinate repeated across N partitions can reach card × N and blow past the key type — "Dictionary key bigger than the key type" (reported by @ghostiee-11 on a float32 GROUP BY lat, lon). Make the encoding overflow-proof and only apply it where it pays off: - `_coord_index_type` floors the key at int32 (~2.1B combined entries covers any realistic grid; int64 backstops the rest). int8/int16 are gone. - `_as_dictionary_field` encodes a coordinate only when the int32 key is strictly narrower than the value type: 8-byte float64/int64/timestamp coordinates (a safe 2x) and variable-width strings, leaving 4-byte float32/int32 coordinates dense (where a dictionary is pure overhead and the only way to win would be an overflow-prone narrow key). - `total_byte_size` reports Inexact when any column is dictionary-encoded, since it sizes those by the value type (a safe upper bound) not the narrower index — honest now that the index is smaller (addresses the "not exact" note). Regression test: float32 GROUP BY lat, lon over 100 partitions matches xarray and no longer overflows. Stats byte-size expectation updated to Inexact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

Coordinate columns are dictionary-encoded internally; DataFusion surfaces those as pandas Categorical, which sorts by category order and trips dtype checks. Decode them back to their value dtype at the to_pandas boundary so callers see the same plain columns as before the encoding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

alxmrs commented Jul 2, 2026

View reviewed changes

ghostiee-11 reviewed Jul 2, 2026

View reviewed changes

claude added 4 commits July 2, 2026 20:49

alxmrs force-pushed the claude/dict-encode-coords-fs1bqv branch 3 times, most recently from 6f57863 to 44087ed Compare July 2, 2026 20:58

alxmrs marked this pull request as draft July 2, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dictionary-encode coordinate columns#217

Dictionary-encode coordinate columns#217
alxmrs wants to merge 4 commits into
mainfrom
claude/dict-encode-coords-fs1bqv

alxmrs commented Jul 2, 2026 •

edited

Loading

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs left a comment

Uh oh!

jayendra13 commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

ghostiee-11 left a comment

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

alxmrs commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prior art

Why

What

Measured

Correctness / risk

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

jayendra13 commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

ghostiee-11 left a comment

Choose a reason for hiding this comment

Uh oh!

alxmrs commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alxmrs commented Jul 2, 2026 •

edited

Loading