Skip to content

Integer grid-position columns for exact grid joins#219

Open
alxmrs wants to merge 1 commit into
mainfrom
claude/grid-index-columns-fs1bqv
Open

Integer grid-position columns for exact grid joins#219
alxmrs wants to merge 1 commit into
mainfrom
claude/grid-index-columns-fs1bqv

Conversation

@alxmrs

@alxmrs alxmrs commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Add opt-in int32 <dim>_idx columns so grid joins (regridding, forecast alignment) can key on exact integer positions instead of floating-point coordinate values.

Why

Regridding is a sparse matmul expressed as JOIN source grid TO weight table ON the source coordinate. Keying that join on the float coordinate value is fragile: any sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 (cases 07/09) — makes the equality join silently drop rows and return a wrong answer with no error.

A prototype at realistic scale (900×900 source, 3.24M weight rows) makes this concrete:

=== correctness when weight coords pass through float32 (a reproject UDF) ===
  float-key join :    1,077 / 810,000 dst cells matched
  int-key  join  :  810,000 / 810,000 dst cells matched
  -> float join silently DROPPED 99.87% of cells; int join is exact

Speed is a secondary ~1.1× (the GROUP BY/SUM dominates) and the keys are 2× smaller (int32 vs float64).

What

from_dataset(..., index_columns=True) (and read_xarray_table(..., index_columns=True)) emit, for every dimension, an int32 <dim>_idx column carrying each row's absolute integer position on that axis:

ctx.from_dataset("src", src, chunks={"time": 24}, index_columns=True)
# join the weight table on integer grid position, not float coords:
ctx.sql('''
  SELECT w.dst_id, SUM(s.value * w.weight) AS out
  FROM weights w JOIN src s
    ON s.lat_idx = w.src_lat_idx AND s.lon_idx = w.src_lon_idx
  GROUP BY w.dst_id
''')
  • Plain Int32 columns — not dictionary-encoded — so none of DataFusion's join/aggregate/scalar-function paths are stressed (this is the safe alternative to the shelved dictionary-encoding approach in Dictionary-encode coordinate columns #217).
  • Global, not per-partition: the reader adds each block's start offset, so the index lines up across chunks. A local index would restart at 0 in every partition and mis-join — this is the key correctness property, and it's tested directly.
  • Coordinate columns stay dense and available for value predicates (WHERE lat > 45, date_part) and display. Index columns are opt-in, off by default, so nothing changes for existing users.

Implementation

  • df.py: _parse_schema(index_columns=) appends the <dim>_idx fields (with a collision guard); iter_record_batches / dataset_to_record_batch emit them from the strided position plus a block offset.
  • reader.py / sql.py: thread index_columns through read_xarray_table and from_dataset, computing per-block offsets.

Tests

tests/test_grid_index.py: schema/dtype, indices global across chunks, an exact index-keyed regrid matching a numpy gather, and the float32-drift case where the float-equality join drops cells but the index join stays exact. Full suite green (190), plus ruff and mypy. No Rust changes.

🤖 Generated with Claude Code

https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N


Generated by Claude Code

Regridding and forecast alignment join a source grid to a table on the source
coordinate. Joining on the floating-point coordinate value is fragile: any
sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 — makes
the equality join silently drop rows, producing a wrong answer with no error.
At realistic scale a float32 round-trip of the weight-table coordinates drops
~99.9% of destination cells.

Add `from_dataset(..., index_columns=True)`: for every dimension, emit an
`int32` `<dim>_idx` column carrying each row's *absolute* integer position on
that axis. Joining grids on these integer keys is exact (no float-equality
mismatch), a bit faster (integer hashing), and half the key bytes — and,
unlike dictionary-encoded coordinates, they are plain Int32 columns, so nothing
in DataFusion's join/aggregate/scalar-function paths is stressed.

The indices are global, not per-partition: the reader adds each block's start
offset so keys line up across chunks (a local index would restart at 0 in every
partition and mis-join). Coordinate columns stay dense and available for value
predicates (`WHERE lat > 45`, `date_part`) and display; the index columns are
opt-in and off by default.

- df.py: `_parse_schema(index_columns=)` appends the `<dim>_idx` fields (with a
  collision guard); `iter_record_batches` / `dataset_to_record_batch` emit them
  from the strided position plus a block offset.
- reader.py / sql.py: thread `index_columns` through `read_xarray_table` and
  `from_dataset`, computing per-block offsets so indices are global.
- tests: global-across-chunks, exact index-keyed regrid, and the float32-drift
  case where the float join drops cells but the index join stays exact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants