Skip to content

DataFusion: aggregating a dictionary column across partitions overflows the key type ("Dictionary key bigger than the key type") #220

Description

@alxmrs

Tracking issue for an upstream DataFusion bug (to file against apache/datafusion). One of two blockers for dictionary-encoding coordinate columns (#217).

Summary

GROUP BY over a dictionary-typed column fails when the combined dictionary across partitions exceeds the per-batch key width:

DataFusion error: Arrow error: Dictionary key bigger than the key type

Each input partition is a valid Dictionary(Int8, Int64) — its dictionary has ≤128 distinct values, so an Int8 key is legal per batch. But the partitions carry disjoint values, so when the aggregate combines them the combined dictionary exceeds 128 entries and the Int8 key overflows.

Minimal repro

Pure datafusion + pyarrow, no xarray-sql, no network: repros/datafusion/dict_key_overflow.py (branch claude/datafusion-upstream-repros-fs1bqv; a paste-in Rust test is in the README).

# 50 partitions, each Dictionary(Int8, Int64) of 100 DISJOINT values
#   -> combined cardinality 5000 >> Int8 max (127)
ctx.sql("SELECT k, SUM(v) FROM t GROUP BY k")  # -> "Dictionary key bigger than the key type"

The disjoint values matter: when every partition shares the same dictionary values, arrow unifies them and there is no overflow — which is why this is intermittent and version-sensitive in the wild (it does not reproduce on arrow-rs 58.3 when the values coincide).

Expected

Combining dictionary-typed columns across partitions should not fail on a key type that was valid for each input batch. It should widen the key (Int8 → Int16 → …) or decode, since a producer cannot know per batch how large the combined dictionary will become downstream.

Environment

datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).

Impact on xarray-sql

Blocks dictionary-encoding coordinate columns (#217): an unchunked coordinate repeated across partitions, or a chunked coordinate whose partitions hold disjoint slices, can exceed a narrow key under streaming aggregation. Reported downstream in #217 by @ghostiee-11.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions