Skip to content

DataSet.iterate_items(processor=...) possibly confusing and has two unrelated concerns #595

@dale-wahl

Description

@dale-wahl

The processor parameter on DataSet.iterate_items / DataSet._iterate_items is used for two unrelated things. Some callers don't pass it and silently lose features; one caller passes a class instead of an instance which does nothing silently.

What processor= actually does (because I was confused at first)

  1. Interrupt source — checked via hasattr(processor, "interrupted") and processor.interrupted so iteration aborts cleanly when the calling processor is interrupted. (We often do this redundantly in process calls, but this is a nice catch.)
  2. Log-routing target — used downstream by warn_unmappable_item (where I discovered this issue) to route per-item map-failure entries to processor.dataset (the processing dataset's log) rather than the source dataset. I think that's still nice to have (running a processor on a DataSet but it has to skip unmapped items should be in that new DataSet's log).

The name also collides with own_processor, which iterate_items uses internally for map_item resolution (what I originally thought it was for). Two unrelated "processors" in one method.

Concrete failure modes

  • Caller passes nothing → silent degradation: no interrupt support, per-item warnings land in the source dataset's log instead of the caller's output log. Quite a few examples: tf_idf.py, vectorise.py, topic_modeling.py, generate_embeddings.py and others. Some of these don't have the self.interrupted check in the loop either.
  • Caller passes a class → silent no-op: webtool/views/views_dataset.py pass dataset.get_own_processor(), which returns a class, not an instance. Probably me or someone like me thought it controlled map_item resolution.
  • Most processors pass self and work correctly (both attributes happen to satisfy both uses). Yay.

What do we do?

Cleanup I'll do a mini audit and fix callers: pass self from every processor; omit from views_dataset callers (even though it does nothing, it is... wrong? or just confusing). And I could add a runtime check that warns if a class is passed.

But what's better?

1. Split into two parameters. Maybe a interrupt_source= and log_target=. It's clearer, though we lose the easy self.source_dataset.iterate_items(self) convenience.

2. Create an iteration wrapper in BasicProcessor. Add BasicProcessor.iterate(self, dataset, **kwargs); processors call self.iterate(self.source_dataset). DataSet.iterate_items drops the processor parameter entirely — pure generator for non-processor callers. Clean, but rewrites every processor call site. For running processors, passing processor should not be options (but it has to be since the DataSet iterate_items is used elsewhere directly).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions