Integer grid-position columns for exact grid joins#219
Open
alxmrs wants to merge 1 commit into
Open
Conversation
Regridding and forecast alignment join a source grid to a table on the source coordinate. Joining on the floating-point coordinate value is fragile: any sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 — makes the equality join silently drop rows, producing a wrong answer with no error. At realistic scale a float32 round-trip of the weight-table coordinates drops ~99.9% of destination cells. Add `from_dataset(..., index_columns=True)`: for every dimension, emit an `int32` `<dim>_idx` column carrying each row's *absolute* integer position on that axis. Joining grids on these integer keys is exact (no float-equality mismatch), a bit faster (integer hashing), and half the key bytes — and, unlike dictionary-encoded coordinates, they are plain Int32 columns, so nothing in DataFusion's join/aggregate/scalar-function paths is stressed. The indices are global, not per-partition: the reader adds each block's start offset so keys line up across chunks (a local index would restart at 0 in every partition and mis-join). Coordinate columns stay dense and available for value predicates (`WHERE lat > 45`, `date_part`) and display; the index columns are opt-in and off by default. - df.py: `_parse_schema(index_columns=)` appends the `<dim>_idx` fields (with a collision guard); `iter_record_batches` / `dataset_to_record_batch` emit them from the strided position plus a block offset. - reader.py / sql.py: thread `index_columns` through `read_xarray_table` and `from_dataset`, computing per-block offsets so indices are global. - tests: global-across-chunks, exact index-keyed regrid, and the float32-drift case where the float join drops cells but the index join stays exact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add opt-in
int32<dim>_idxcolumns so grid joins (regridding, forecast alignment) can key on exact integer positions instead of floating-point coordinate values.Why
Regridding is a sparse matmul expressed as
JOIN source grid TO weight table ON the source coordinate. Keying that join on the float coordinate value is fragile: any sub-ULP drift — e.g. a reproject/interp UDF that computes in float32 (cases 07/09) — makes the equality join silently drop rows and return a wrong answer with no error.A prototype at realistic scale (900×900 source, 3.24M weight rows) makes this concrete:
Speed is a secondary ~1.1× (the GROUP BY/SUM dominates) and the keys are 2× smaller (int32 vs float64).
What
from_dataset(..., index_columns=True)(andread_xarray_table(..., index_columns=True)) emit, for every dimension, anint32<dim>_idxcolumn carrying each row's absolute integer position on that axis:Int32columns — not dictionary-encoded — so none of DataFusion's join/aggregate/scalar-function paths are stressed (this is the safe alternative to the shelved dictionary-encoding approach in Dictionary-encode coordinate columns #217).WHERE lat > 45,date_part) and display. Index columns are opt-in, off by default, so nothing changes for existing users.Implementation
df.py:_parse_schema(index_columns=)appends the<dim>_idxfields (with a collision guard);iter_record_batches/dataset_to_record_batchemit them from the strided position plus a block offset.reader.py/sql.py: threadindex_columnsthroughread_xarray_tableandfrom_dataset, computing per-block offsets.Tests
tests/test_grid_index.py: schema/dtype, indices global across chunks, an exact index-keyed regrid matching a numpy gather, and the float32-drift case where the float-equality join drops cells but the index join stays exact. Full suite green (190), plusruffandmypy. No Rust changes.🤖 Generated with Claude Code
https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N
Generated by Claude Code