The processor parameter on DataSet.iterate_items / DataSet._iterate_items is used for two unrelated things. Some callers don't pass it and silently lose features; one caller passes a class instead of an instance which does nothing silently.
What processor= actually does (because I was confused at first)
- Interrupt source — checked via
hasattr(processor, "interrupted") and processor.interrupted so iteration aborts cleanly when the calling processor is interrupted. (We often do this redundantly in process calls, but this is a nice catch.)
- Log-routing target — used downstream by
warn_unmappable_item (where I discovered this issue) to route per-item map-failure entries to processor.dataset (the processing dataset's log) rather than the source dataset. I think that's still nice to have (running a processor on a DataSet but it has to skip unmapped items should be in that new DataSet's log).
The name also collides with own_processor, which iterate_items uses internally for map_item resolution (what I originally thought it was for). Two unrelated "processors" in one method.
Concrete failure modes
- Caller passes nothing → silent degradation: no interrupt support, per-item warnings land in the source dataset's log instead of the caller's output log. Quite a few examples:
tf_idf.py, vectorise.py, topic_modeling.py, generate_embeddings.py and others. Some of these don't have the self.interrupted check in the loop either.
- Caller passes a class → silent no-op:
webtool/views/views_dataset.py pass dataset.get_own_processor(), which returns a class, not an instance. Probably me or someone like me thought it controlled map_item resolution.
- Most processors pass
self and work correctly (both attributes happen to satisfy both uses). Yay.
What do we do?
Cleanup I'll do a mini audit and fix callers: pass self from every processor; omit from views_dataset callers (even though it does nothing, it is... wrong? or just confusing). And I could add a runtime check that warns if a class is passed.
But what's better?
1. Split into two parameters. Maybe a interrupt_source= and log_target=. It's clearer, though we lose the easy self.source_dataset.iterate_items(self) convenience.
2. Create an iteration wrapper in BasicProcessor. Add BasicProcessor.iterate(self, dataset, **kwargs); processors call self.iterate(self.source_dataset). DataSet.iterate_items drops the processor parameter entirely — pure generator for non-processor callers. Clean, but rewrites every processor call site. For running processors, passing processor should not be options (but it has to be since the DataSet iterate_items is used elsewhere directly).
The
processorparameter onDataSet.iterate_items/DataSet._iterate_itemsis used for two unrelated things. Some callers don't pass it and silently lose features; one caller passes a class instead of an instance which does nothing silently.What
processor=actually does (because I was confused at first)hasattr(processor, "interrupted") and processor.interruptedso iteration aborts cleanly when the calling processor is interrupted. (We often do this redundantly inprocesscalls, but this is a nice catch.)warn_unmappable_item(where I discovered this issue) to route per-item map-failure entries toprocessor.dataset(the processing dataset's log) rather than the source dataset. I think that's still nice to have (running a processor on a DataSet but it has to skip unmapped items should be in that new DataSet's log).The name also collides with
own_processor, whichiterate_itemsuses internally for map_item resolution (what I originally thought it was for). Two unrelated "processors" in one method.Concrete failure modes
tf_idf.py,vectorise.py,topic_modeling.py,generate_embeddings.pyand others. Some of these don't have theself.interruptedcheck in the loop either.webtool/views/views_dataset.pypassdataset.get_own_processor(), which returns a class, not an instance. Probably me or someone like me thought it controlled map_item resolution.selfand work correctly (both attributes happen to satisfy both uses). Yay.What do we do?
Cleanup I'll do a mini audit and fix callers: pass
selffrom every processor; omit fromviews_datasetcallers (even though it does nothing, it is... wrong? or just confusing). And I could add a runtime check that warns if a class is passed.But what's better?
1. Split into two parameters. Maybe a
interrupt_source=andlog_target=. It's clearer, though we lose the easyself.source_dataset.iterate_items(self)convenience.2. Create an iteration wrapper in BasicProcessor. Add
BasicProcessor.iterate(self, dataset, **kwargs); processors callself.iterate(self.source_dataset).DataSet.iterate_itemsdrops theprocessorparameter entirely — pure generator for non-processor callers. Clean, but rewrites every processor call site. For running processors, passingprocessorshould not be options (but it has to be since the DataSet iterate_items is used elsewhere directly).