Skip to content

Batch lazy IRI resolution (prefetch / identity map / DataLoader) to avoid N+1 resolves #92

Description

@simontaurus

Summary

Lazy IRI resolution currently happens synchronously, per field, per object. Touching a relation
(entity.some_relation) triggers its own _resolve(...) call. When code walks many relations across many
entities, this becomes the classic N+1 problem: O(nodes) backend round trips, executed sequentially.

This issue proposes a feature to automatically bundle lazy resolutions into a small number of batched
backend calls.

Background

In oold/model/v1/__init__.py, __getattribute__ resolves a relation field on access when the stored
value is still unresolved:

if name in self.__iris__ and len(self.__iris__[name]) > 0:
    if self.__dict__[name] is None or (isinstance(..., list) and len(...) == 0):
        node_dict = self._resolve(iris)   # one call, for one field, of one object
        ...

The primitives needed for batching already exist:

  • _resolve(iris: list) is already batch capable (takes a list, returns {iri: node}).
  • get_iri_ref(field) / get_raw(field) read links without resolving.

So the gap is purely about driving _resolve with all pending IRIs at once instead of one field at a time.

The core constraint

__getattribute__ is synchronous: when you touch entity.relation, the value must exist before the
expression returns. You cannot transparently defer a sync attribute access without a proxy object. So the
design is about where to place the batch boundary. Three viable strategies follow.

Option A: explicit batched prefetch (recommended primary)

Equivalent to ORM selectinload / Django prefetch_related / GraphQL look ahead. The caller declares
relation paths; oold does a level order (BFS) traversal and batches one _resolve per depth level, so
backend calls become O(depth) instead of O(nodes).

def prefetch(roots, paths):
    # paths: ["input", "output", "tool", "output.sample"]
    tree = _paths_to_tree(paths)
    frontier = [(e, tree) for e in roots]
    while frontier:
        pending = [(e, f, iri) for e, sub in frontier
                              for f in sub
                              for iri in _iri_refs(e, f)]
        nodes = backend.resolve(sorted({iri for *_, iri in pending}))   # one call per level
        frontier = [(nodes[iri], sub[f]) for (e, f, iri) in pending
                    if nodes.get(iri) and sub[f]]

Suggested public API:

OSW.load_entity(LoadEntityParam(titles=..., prefetch=["output", "tool", "input"]))
entity.resolve(["output.sample"])
resolve_all(list_of_entities, ["output", "tool", "input"])

Deterministic, no proxy magic, fits sync pydantic, and it is the natural place to make resolution tolerant
of partial failures.

Option B: DataLoader plus identity map session (auto batching)

The DataLoader pattern: a per session loader coalesces load(iri) calls within a window, dispatches one
batched fetch, and caches by IRI (unit of work / identity map, so the same IRI is fetched once and resolves
to one object).

class IriLoader:
    def __init__(self, backend):
        self._backend, self._cache, self._queue = backend, {}, []
    def load(self, iri): ...      # returns a Future
    def flush(self):
        todo = [i for i in self._queue if i not in self._cache]
        self._cache.update(self._backend.resolve(unique(todo)))   # one call
        # resolve queued futures from cache (None or Error allowed per key)

Transparent batching needs a deferral point:

  • Async resolvers (await entity.relation): the event loop tick is the batch window. This is where
    DataLoader shines, and is probably the cleanest long term direction if oold goes async.
  • A sync collect context:
with oold.batch_resolution():
    for proc in procs:
        proc.input
        proc.output
# on exit: one batched resolve, then values are populated

The identity map is valuable on its own (dedupe plus stable object identity), independent of batching.

Option C: lazy reference proxies (transparent, sharp edges)

__getattribute__ returns a LazyRef(iri, batch) instead of resolving. The first real use of any proxy
flushes the shared batch, then forwards. This makes plain loops auto batch, but proxies leak into
isinstance, equality, pydantic validation and serialization. Probably not worth the footguns unless full
transparency is a hard requirement.

Recommendation

  1. Identity map resolution session (with oold.session():): caches iri -> node, dedupes, stabilizes
    identity. Foundation for the rest.
  2. prefetch / selectinload style API (Option A) as the ergonomic front door. Covers most cases,
    deterministic, sync friendly.
  3. Partial failure semantics in the batch resolver: per key node | Error with a policy flag
    (skip / raise / collect_errors). On real wikis full of half valid datasets this turns "one bad
    linked entity kills the traversal" into "you get the good ones plus a list of errors". This is the part
    that matters most in practice.
  4. DataLoader (Option B) later, naturally, if and when resolvers become async.

Design notes

  • Keep the dict shaped resolver ({iri: node}): it is order independent and dedupes. Let the prefetch
    layer own re association back to (entity, field, position) rather than returning aligned lists that
    break on dedupe.
  • A good litmus test: a hand rolled get_iri_ref plus batched load workaround should collapse into
    resolve_all(entities, ["output", "tool", "input"], errors="skip").

Motivation / real world case

Building a dashboard over measurement data, we walk ProcessDocumentation -> input (Sample),
-> output (Dataset), -> tool (MeasurementUnit) across many process docs. Lazy per field resolution
caused:

  1. Sequential round trips (one linked page at a time), very slow for a handful of processes.
  2. A single invalid linked dataset (for example a Dataset with an out of enum unit) raising mid
    traversal and aborting the whole walk.

The workaround was to bypass resolution entirely via get_iri_ref and batch load the linked entities in one
parallel call, tolerating per entity validation failures. A first class prefetch plus identity map plus
partial failure feature would make that the default rather than a manual pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions