Summary
Lazy IRI resolution currently happens synchronously, per field, per object. Touching a relation
(entity.some_relation) triggers its own _resolve(...) call. When code walks many relations across many
entities, this becomes the classic N+1 problem: O(nodes) backend round trips, executed sequentially.
This issue proposes a feature to automatically bundle lazy resolutions into a small number of batched
backend calls.
Background
In oold/model/v1/__init__.py, __getattribute__ resolves a relation field on access when the stored
value is still unresolved:
if name in self.__iris__ and len(self.__iris__[name]) > 0:
if self.__dict__[name] is None or (isinstance(..., list) and len(...) == 0):
node_dict = self._resolve(iris) # one call, for one field, of one object
...
The primitives needed for batching already exist:
_resolve(iris: list) is already batch capable (takes a list, returns {iri: node}).
get_iri_ref(field) / get_raw(field) read links without resolving.
So the gap is purely about driving _resolve with all pending IRIs at once instead of one field at a time.
The core constraint
__getattribute__ is synchronous: when you touch entity.relation, the value must exist before the
expression returns. You cannot transparently defer a sync attribute access without a proxy object. So the
design is about where to place the batch boundary. Three viable strategies follow.
Option A: explicit batched prefetch (recommended primary)
Equivalent to ORM selectinload / Django prefetch_related / GraphQL look ahead. The caller declares
relation paths; oold does a level order (BFS) traversal and batches one _resolve per depth level, so
backend calls become O(depth) instead of O(nodes).
def prefetch(roots, paths):
# paths: ["input", "output", "tool", "output.sample"]
tree = _paths_to_tree(paths)
frontier = [(e, tree) for e in roots]
while frontier:
pending = [(e, f, iri) for e, sub in frontier
for f in sub
for iri in _iri_refs(e, f)]
nodes = backend.resolve(sorted({iri for *_, iri in pending})) # one call per level
frontier = [(nodes[iri], sub[f]) for (e, f, iri) in pending
if nodes.get(iri) and sub[f]]
Suggested public API:
OSW.load_entity(LoadEntityParam(titles=..., prefetch=["output", "tool", "input"]))
entity.resolve(["output.sample"])
resolve_all(list_of_entities, ["output", "tool", "input"])
Deterministic, no proxy magic, fits sync pydantic, and it is the natural place to make resolution tolerant
of partial failures.
Option B: DataLoader plus identity map session (auto batching)
The DataLoader pattern: a per session loader coalesces load(iri) calls within a window, dispatches one
batched fetch, and caches by IRI (unit of work / identity map, so the same IRI is fetched once and resolves
to one object).
class IriLoader:
def __init__(self, backend):
self._backend, self._cache, self._queue = backend, {}, []
def load(self, iri): ... # returns a Future
def flush(self):
todo = [i for i in self._queue if i not in self._cache]
self._cache.update(self._backend.resolve(unique(todo))) # one call
# resolve queued futures from cache (None or Error allowed per key)
Transparent batching needs a deferral point:
- Async resolvers (
await entity.relation): the event loop tick is the batch window. This is where
DataLoader shines, and is probably the cleanest long term direction if oold goes async.
- A sync collect context:
with oold.batch_resolution():
for proc in procs:
proc.input
proc.output
# on exit: one batched resolve, then values are populated
The identity map is valuable on its own (dedupe plus stable object identity), independent of batching.
Option C: lazy reference proxies (transparent, sharp edges)
__getattribute__ returns a LazyRef(iri, batch) instead of resolving. The first real use of any proxy
flushes the shared batch, then forwards. This makes plain loops auto batch, but proxies leak into
isinstance, equality, pydantic validation and serialization. Probably not worth the footguns unless full
transparency is a hard requirement.
Recommendation
- Identity map resolution session (
with oold.session():): caches iri -> node, dedupes, stabilizes
identity. Foundation for the rest.
prefetch / selectinload style API (Option A) as the ergonomic front door. Covers most cases,
deterministic, sync friendly.
- Partial failure semantics in the batch resolver: per key
node | Error with a policy flag
(skip / raise / collect_errors). On real wikis full of half valid datasets this turns "one bad
linked entity kills the traversal" into "you get the good ones plus a list of errors". This is the part
that matters most in practice.
- DataLoader (Option B) later, naturally, if and when resolvers become async.
Design notes
- Keep the dict shaped resolver (
{iri: node}): it is order independent and dedupes. Let the prefetch
layer own re association back to (entity, field, position) rather than returning aligned lists that
break on dedupe.
- A good litmus test: a hand rolled
get_iri_ref plus batched load workaround should collapse into
resolve_all(entities, ["output", "tool", "input"], errors="skip").
Motivation / real world case
Building a dashboard over measurement data, we walk ProcessDocumentation -> input (Sample),
-> output (Dataset), -> tool (MeasurementUnit) across many process docs. Lazy per field resolution
caused:
- Sequential round trips (one linked page at a time), very slow for a handful of processes.
- A single invalid linked dataset (for example a
Dataset with an out of enum unit) raising mid
traversal and aborting the whole walk.
The workaround was to bypass resolution entirely via get_iri_ref and batch load the linked entities in one
parallel call, tolerating per entity validation failures. A first class prefetch plus identity map plus
partial failure feature would make that the default rather than a manual pattern.
Summary
Lazy IRI resolution currently happens synchronously, per field, per object. Touching a relation
(
entity.some_relation) triggers its own_resolve(...)call. When code walks many relations across manyentities, this becomes the classic N+1 problem: O(nodes) backend round trips, executed sequentially.
This issue proposes a feature to automatically bundle lazy resolutions into a small number of batched
backend calls.
Background
In
oold/model/v1/__init__.py,__getattribute__resolves a relation field on access when the storedvalue is still unresolved:
The primitives needed for batching already exist:
_resolve(iris: list)is already batch capable (takes a list, returns{iri: node}).get_iri_ref(field)/get_raw(field)read links without resolving.So the gap is purely about driving
_resolvewith all pending IRIs at once instead of one field at a time.The core constraint
__getattribute__is synchronous: when you touchentity.relation, the value must exist before theexpression returns. You cannot transparently defer a sync attribute access without a proxy object. So the
design is about where to place the batch boundary. Three viable strategies follow.
Option A: explicit batched prefetch (recommended primary)
Equivalent to ORM
selectinload/ Djangoprefetch_related/ GraphQL look ahead. The caller declaresrelation paths; oold does a level order (BFS) traversal and batches one
_resolveper depth level, sobackend calls become O(depth) instead of O(nodes).
Suggested public API:
Deterministic, no proxy magic, fits sync pydantic, and it is the natural place to make resolution tolerant
of partial failures.
Option B: DataLoader plus identity map session (auto batching)
The DataLoader pattern: a per session loader coalesces
load(iri)calls within a window, dispatches onebatched fetch, and caches by IRI (unit of work / identity map, so the same IRI is fetched once and resolves
to one object).
Transparent batching needs a deferral point:
await entity.relation): the event loop tick is the batch window. This is whereDataLoader shines, and is probably the cleanest long term direction if oold goes async.
The identity map is valuable on its own (dedupe plus stable object identity), independent of batching.
Option C: lazy reference proxies (transparent, sharp edges)
__getattribute__returns aLazyRef(iri, batch)instead of resolving. The first real use of any proxyflushes the shared batch, then forwards. This makes plain loops auto batch, but proxies leak into
isinstance, equality, pydantic validation and serialization. Probably not worth the footguns unless fulltransparency is a hard requirement.
Recommendation
with oold.session():): cachesiri -> node, dedupes, stabilizesidentity. Foundation for the rest.
prefetch/selectinloadstyle API (Option A) as the ergonomic front door. Covers most cases,deterministic, sync friendly.
node | Errorwith a policy flag(
skip/raise/collect_errors). On real wikis full of half valid datasets this turns "one badlinked entity kills the traversal" into "you get the good ones plus a list of errors". This is the part
that matters most in practice.
Design notes
{iri: node}): it is order independent and dedupes. Let the prefetchlayer own re association back to
(entity, field, position)rather than returning aligned lists thatbreak on dedupe.
get_iri_refplus batched load workaround should collapse intoresolve_all(entities, ["output", "tool", "input"], errors="skip").Motivation / real world case
Building a dashboard over measurement data, we walk
ProcessDocumentation -> input (Sample),-> output (Dataset),-> tool (MeasurementUnit)across many process docs. Lazy per field resolutioncaused:
Datasetwith an out of enum unit) raising midtraversal and aborting the whole walk.
The workaround was to bypass resolution entirely via
get_iri_refand batch load the linked entities in oneparallel call, tolerating per entity validation failures. A first class prefetch plus identity map plus
partial failure feature would make that the default rather than a manual pattern.