fix: Perform cleanup of intermediary data safely by kfaraz · Pull Request #19650 · apache/druid

kfaraz · 2026-07-02T11:40:03Z

Description

#19187 introduced cleanup of intermediary data in ParallelIndexSupervisorTask.cleanup() step.
It invokes the DataSegmentKiller (bound on peons to OmniDataSegmentKiller) and cleans up all files under the directory specified by the supervisor task id. The OmniDataSegmentKiller tries to do a recursive kill on all possible implementations of DataSegmentKiller.

Bug

This can lead to task failures in cases where an extension (say Azure or GCS) has been loaded on the cluster but is not configured to work as a deep store.

com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) druid.azure.container - must not be null
  at JsonConfigProvider.bind(JsonConfigProvider.java:151)
      \_ installed by: Modules$OverrideModule -> AzureStorageDruidModule
  at AzureDataSegmentKiller.<init>(AzureDataSegmentKiller.java:61)
      \_ for 1st parameter
  while locating AzureDataSegmentKiller
  at AzureStorageDruidModule.configure(AzureStorageDruidModule.java:103)
      \_ installed by: Modules$OverrideModule -> AzureStorageDruidModule
  while locating DataSegmentKiller annotated with @Element(setName=,uniqueId=447, type=MAPBINDER, keyType=String)

1 error

======================
Full classname legend:
======================
AzureDataSegmentKiller:  "org.apache.druid.storage.azure.AzureDataSegmentKiller"
AzureStorageDruidModule: "org.apache.druid.storage.azure.AzureStorageDruidModule"
DataSegmentKiller:       "org.apache.druid.segment.loading.DataSegmentKiller"
Element:                 "com.google.inject.internal.Element"
JsonConfigProvider:      "org.apache.druid.guice.JsonConfigProvider"
Modules$OverrideModule:  "com.google.inject.util.Modules$OverrideModule"
========================
End of classname legend:
========================

	at com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:251)
	at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1151)
	at com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:181)
	at org.apache.druid.segment.loading.OmniDataSegmentKiller.killRecursively(OmniDataSegmentKiller.java:109)
	at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.cleanUp(ParallelIndexSupervisorTask.java:1840)

#19647 was meant to mitigate this issue by ignoring such exceptions and allowing the task to succeed.
The exception would still show up as a warning in the task logs.

There is also an inherent risk with this approach since it is possible (however unlikely) that the same path prefix may be under use in other deep storage environments for completely different purposes.
This is especially a concern in cases where an extension (say Azure) is loaded only to be used as an input source.
For the same reason, OmniDataSegmentKiller.killAll() has not been implemented yet either.

Changes

Fix OmniDataSegmentKiller.killRecursively() to only invoke the currently bound deep store
Use IntermediaryDataManager in ParallelIndexSupervisorTask instead of invoking DataSegmentKiller directly.
This allows use of a cleaner interface at the task level and also provides symmetry with DataSegmentPusher within DeepStorageIntermediaryDataManager.
Implement deletePartitions and invoke DataSegmentKiller.killRecursively()
Deprecate DataSegmentKiller.killAll() since it is not used anymore and is a potentially risky operation
Add javadocs and comments
Fix tests

This PR has:

FrankChen021

Severity	Findings
P0	0
P1	1
P2	1
P3	0
Total	2

Found 2 issues.

Reviewed 8 of 8 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-07-02T12:11:50Z


  @Inject
-  public DeepStorageIntermediaryDataManager(DataSegmentPusher dataSegmentPusher)
+  public DeepStorageIntermediaryDataManager(DataSegmentPusher dataSegmentPusher, DataSegmentKiller dataSegmentKiller)


[P1] S3 deepstore mode can no longer provision this manager

This constructor now requires a DataSegmentKiller anywhere DeepStorageIntermediaryDataManager is selected. On middleManager-style injectors that killer is PolyBind-selected by druid.storage.type, but S3 sets that to s3 while S3StorageDruidModule registers its killer under s3_zip. With druid.processing.intermediaryData.storage.type=deepstore on S3, Guice fails to provision this manager instead of starting the service.

Hmm, this seems to be true. I was under the assumption that DataSegmentKiller is always bound to OmniDataSegmentKiller at the top level. But that is applicable only for peons (via CliPeon) and Overlord. MMs do not bind to OmniDataSegmentKiller and use the extension provided DataSegmentKiller itself.
So, when using S3, MMs would not have any initialized binding for DataSegmentKiller.

However, MMs do use IntermediaryDataManager.

I will revert this change and add javadocs to avoid further confusion.

Trying out an optional dependency first.

Thanks for checking. I don't think @Nullable makes this dependency optional in this path: DataSegmentKiller is still bound through LocalDataStorageDruidModule's PolyBind.createChoice, and PolyBind.ConfiggedProvider#get() throws a ProvisionException when druid.storage.type=s3 cannot be found in the killer map, which still only has s3_zip. That happens while Guice resolves the constructor argument, before this constructor can receive null, so the MiddleManager deepstore provisioning failure should still reproduce.

Reviewed 17 of 17 changed files.

FrankChen021 · 2026-07-02T12:11:50Z

+    // Delegate kill to the currently bound deep storage implementation only.
+    // It is unnecessary and also potentially unsafe to kill all files under the
+    // same folder name on inactive deep storage bindings provided by other extensions.
+    final Supplier<DataSegmentKiller> killer = killers.get(deepStorageType);


[P2] Recursive cleanup looks up killers with the wrong key

The killers map is keyed by segment loadSpec killer type, not always by druid.storage.type. S3 and OSS use storage types s3 and oss but killer keys s3_zip and oss_zip, so this lookup throws during recursive cleanup for those configured stores instead of preserving the prior no-op behavior for unsupported recursive cleanup.

Ugh!

So this means that we are always forced to use the OmniDataSegmentKiller since S3DataSegmentKiller is never truly bound as a DataSegmentKiller.

The S3DataSegmentKiller is forced to use type s3_zip since the S3LoadSpec uses s3_zip. For backwards compat reading old segments, neither of these can change.

Workaround option would be to register S3DataSegmentKiller under the s3 key as well, which should be okay since the S3DataSegmentKiller does not keep any state, so it is fine to end up with 2 S3DataSegmentKiller, one bound to key s3, the other to s3_zip.

I think we can live with this for now since the killRecursively() method is not implemented for S3 or OSS anyway. Going to add some comments to that effect.

Perform cleanup of intermediary data safely

ea038d1

kfaraz requested a review from GWphua July 2, 2026 11:40

github-actions Bot added Area - Segment Format and Ser/De Area - Ingestion labels Jul 2, 2026

Minor fixes

d3f617d

FrankChen021 reviewed Jul 2, 2026

View reviewed changes

kfaraz added 3 commits July 2, 2026 19:51

Try optional dependency

29e972a

Fix tests

9be75a9

Fix test

9321736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Perform cleanup of intermediary data safely#19650

fix: Perform cleanup of intermediary data safely#19650
kfaraz wants to merge 5 commits into
apache:masterfrom
kfaraz:safe_intermediary_cleanup

kfaraz commented Jul 2, 2026 •

edited

Loading

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 Jul 2, 2026

Uh oh!

kfaraz Jul 2, 2026

Uh oh!

kfaraz Jul 2, 2026

Uh oh!

FrankChen021 Jul 3, 2026

Uh oh!

FrankChen021 Jul 2, 2026

Uh oh!

kfaraz Jul 2, 2026

Uh oh!

kfaraz Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kfaraz commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Bug

Changes

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

kfaraz Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kfaraz commented Jul 2, 2026 •

edited

Loading