fix: infer surviving dims in to_dataset for aggregations (closes #189)#218
fix: infer surviving dims in to_dataset for aggregations (closes #189)#218ghostiee-11 wants to merge 2 commits into
Conversation
…ystems#189) to_dataset() inferred dims only when a registered Dataset's entire dim set appeared in the result, so a GROUP BY that aggregates dims away (e.g. SELECT "time", AVG("air") ... GROUP BY "time" over a time/lat/lon grid) raised "dims cannot be inferred" instead of using the surviving group-by dim. _infer_dimension_columns now uses the registered dims that survive into the result columns, in the data-variable axis order, so such aggregations round-trip on the remaining dim(s). Several registered Datasets that imply different surviving dims still raise; a global aggregation with no surviving dim raises a clearer message. Repurposes the now-inferable test to a global-aggregation case and adds an auto-inference regression test.
alxmrs
left a comment
There was a problem hiding this comment.
Some notes, overall it's looking great. Happy to have this feature.
|
|
||
|
|
||
| def test_aggregation_infers_dims(air_dataset_small): | ||
| """#189: to_dataset() infers the surviving GROUP BY dims without dims=.""" |
There was a problem hiding this comment.
Please add a Claude.md or Agents.md file to the project in this PR, and add a summary of common points of feedback I've given to you in our reviews.
For example, we shouldn't include GH issue numbers in docstrings (documentation). These should be self contained and not rely on access to the issue tracker.
| # GROUP BY lat, lon: time is aggregated away, lat/lon survive. | ||
| out = ctx.sql( | ||
| "SELECT lat, lon, AVG(air) AS air_avg FROM air GROUP BY lat, lon" | ||
| ).to_dataset() |
|
|
||
| # GROUP BY lat, lon: time is aggregated away, lat/lon survive. | ||
| out = ctx.sql( | ||
| "SELECT lat, lon, AVG(air) AS air_avg FROM air GROUP BY lat, lon" |
There was a problem hiding this comment.
Let's order by lat desc so we can get rid of a sort in the assert.
| out.sortby(["lat", "lon"])["air_avg"].values, expected | ||
| ) | ||
|
|
||
| # The reporter's exact case: GROUP BY the time coordinate. |
There was a problem hiding this comment.
Rm this comment or rephrase it to not refer to the chat conversation
| ctx.sql( | ||
| "SELECT lat, lon, AVG(air) AS air_avg FROM air GROUP BY lat, lon" | ||
| ).to_dataset() | ||
| ctx.sql("SELECT AVG(air) AS air_avg FROM air").to_dataset() |
There was a problem hiding this comment.
Will you also fix all the other tests that have specified dims when they now no longer need them?
Drop now-redundant explicit dims= from single-registration aggregation tests (inference resolves them), make the inference test self-contained and deterministic via ORDER BY, and remove issue-number and conversation references from test docstrings and comments. Add AGENTS.md summarizing the maintainer's review conventions: self-contained docs, private helpers, public-contract tests, ORDER BY in tests, and conventional commits.
to_dataset()inferred dims only when a registered Dataset's entire dim set was present in the result, so aGROUP BYthat aggregates dims away (the reporter'sSELECT "time", AVG("air") ... GROUP BY "time") raiseddims cannot be inferredinstead of using the surviving group-by dim. Inference now uses the registered dims that survive into the result columns (in data-variable axis order), so such aggregations round-trip on the remaining dim(s). Closes #189.Also: when several registered Datasets imply the same surviving dims, inference returns them rather than raising "ambiguous" (only genuinely differing dims still raise).
Before / after
Before (main),
.to_dataset()on the group-by result:After (this PR):
Full non-integration test suite passes locally.