Integer grid-position columns for exact grid joins by alxmrs · Pull Request #219 · xqlsystems/xarray-sql

alxmrs · 2026-07-03T09:20:47Z

Add opt-in int32 <dim>_idx columns so grid joins (regridding, forecast alignment) can key on exact integer positions instead of floating-point coordinate values.

Why

Regridding is a sparse matmul expressed as JOIN source grid TO weight table ON the source coordinate. Keying that join on the float coordinate value is fragile: any sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 (cases 07/09) — makes the equality join silently drop rows and return a wrong answer with no error.

A prototype at realistic scale (900×900 source, 3.24M weight rows) makes this concrete:

=== correctness when weight coords pass through float32 (a reproject UDF) ===
  float-key join :    1,077 / 810,000 dst cells matched
  int-key  join  :  810,000 / 810,000 dst cells matched
  -> float join silently DROPPED 99.87% of cells; int join is exact

Speed is a secondary ~1.1× (the GROUP BY/SUM dominates) and the keys are 2× smaller (int32 vs float64).

What

from_dataset(..., index_columns=True) (and read_xarray_table(..., index_columns=True)) emit, for every dimension, an int32 <dim>_idx column carrying each row's absolute integer position on that axis:

ctx.from_dataset("src", src, chunks={"time": 24}, index_columns=True)
# join the weight table on integer grid position, not float coords:
ctx.sql('''
  SELECT w.dst_id, SUM(s.value * w.weight) AS out
  FROM weights w JOIN src s
    ON s.lat_idx = w.src_lat_idx AND s.lon_idx = w.src_lon_idx
  GROUP BY w.dst_id
''')

Plain Int32 columns — not dictionary-encoded — so none of DataFusion's join/aggregate/scalar-function paths are stressed (this is the safe alternative to the shelved dictionary-encoding approach in Dictionary-encode coordinate columns #217).
Global, not per-partition: the reader adds each block's start offset, so the index lines up across chunks. A local index would restart at 0 in every partition and mis-join — this is the key correctness property, and it's tested directly.
Coordinate columns stay dense and available for value predicates (WHERE lat > 45, date_part) and display. Index columns are opt-in, off by default, so nothing changes for existing users.

Implementation

df.py: _parse_schema(index_columns=) appends the <dim>_idx fields (with a collision guard); iter_record_batches / dataset_to_record_batch emit them from the strided position plus a block offset.
reader.py / sql.py: thread index_columns through read_xarray_table and from_dataset, computing per-block offsets.

Tests

tests/test_grid_index.py: schema/dtype, indices global across chunks, an exact index-keyed regrid matching a numpy gather, and the float32-drift case where the float-equality join drops cells but the index join stays exact. Full suite green (190), plus ruff and mypy. No Rust changes.

🤖 Generated with Claude Code

https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

Generated by Claude Code

Regridding and forecast alignment join a source grid to a table on the source coordinate. Joining on the floating-point coordinate value is fragile: any sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 — makes the equality join silently drop rows, producing a wrong answer with no error. At realistic scale a float32 round-trip of the weight-table coordinates drops ~99.9% of destination cells. Add `from_dataset(..., index_columns=True)`: for every dimension, emit an `int32` `<dim>_idx` column carrying each row's *absolute* integer position on that axis. Joining grids on these integer keys is exact (no float-equality mismatch), a bit faster (integer hashing), and half the key bytes — and, unlike dictionary-encoded coordinates, they are plain Int32 columns, so nothing in DataFusion's join/aggregate/scalar-function paths is stressed. The indices are global, not per-partition: the reader adds each block's start offset so keys line up across chunks (a local index would restart at 0 in every partition and mis-join). Coordinate columns stay dense and available for value predicates (`WHERE lat > 45`, `date_part`) and display; the index columns are opt-in and off by default. - df.py: `_parse_schema(index_columns=)` appends the `<dim>_idx` fields (with a collision guard); `iter_record_batches` / `dataset_to_record_batch` emit them from the strided position plus a block offset. - reader.py / sql.py: thread `index_columns` through `read_xarray_table` and `from_dataset`, computing per-block offsets so indices are global. - tests: global-across-chunks, exact index-keyed regrid, and the float32-drift case where the float join drops cells but the index join stays exact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integer grid-position columns for exact grid joins#219

Integer grid-position columns for exact grid joins#219
alxmrs wants to merge 1 commit into
mainfrom
claude/grid-index-columns-fs1bqv

alxmrs commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alxmrs commented Jul 3, 2026

Why

What

Implementation

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants