Skip to content

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key (DATA-8693)#77

Open
snehalahire-pattern wants to merge 4 commits into
mainfrom
snehal/DATA-8693-file-path-context
Open

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key (DATA-8693)#77
snehalahire-pattern wants to merge 4 commits into
mainfrom
snehal/DATA-8693-file-path-context

Conversation

@snehalahire-pattern

Copy link
Copy Markdown
Contributor

Summary

  • Adds a new record context key CATERPILLAR_FILE_PATH_WRITE populated by the file task's read mode, alongside the existing CATERPILLAR_FILE_NAME_WRITE (which is unchanged for backward compatibility).
  • New textutil.SlugifyFilePath helper slugifies each path segment individually so the / hierarchy is preserved; URL schemes like s3://bucket/ are stripped, the final segment keeps its extension. e.g. s3://my-bucket/ReportType=A/Folder 1/data.CSVreporttype_a/folder_1/data.csv.
  • Lets destination paths encode the full source hierarchy (.../ds={{ ds }}/{{ context "CATERPILLAR_FILE_PATH_WRITE" }}), avoiding same-name collisions when reading nested directories with a recursive glob (e.g. reportType=X/**/**.tsv).

ClickUp

DATA-8693

Note on overlapping work

A separate branch feat/file-path-write-context (commit ac18420) takes a different approach to the same ticket: it slugifies the full path into a single flat segment (collapsing / to _) and also wires the new key into the archive/sftp tasks. This PR preserves the directory hierarchy literally, per the ticket's "Path information preserves folder hierarchy" criterion, and is scoped to the file task. Reviewers should pick one approach before merging.

Test plan

  • go build ./... clean (verified locally)
  • go test ./... green (verified locally)
  • Run a pipeline reading reportType=X/**/**.tsv and writing to .../{{ context "CATERPILLAR_FILE_PATH_WRITE" }}; confirm files from different subdirectories no longer overwrite each other in the destination.
  • Confirm existing pipelines using CATERPILLAR_FILE_NAME_WRITE continue to behave identically.

🤖 Generated with Claude Code

Expose the sanitized full source path as a record context value on the
file task's read mode, alongside the existing CATERPILLAR_FILE_NAME_WRITE
(base name only). Path segments are slugified individually with "/"
preserved between them, so the directory hierarchy survives and can be
used in destination paths to avoid same-name collisions when reading
nested folders with a recursive glob (e.g. reportType=X/**/**.tsv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snehalahire-pattern snehalahire-pattern requested a review from a team as a code owner June 23, 2026 16:48
Comment thread internal/pkg/pipeline/task/file/file.go Outdated
@@ -146,6 +146,7 @@ func (f *file) readFile(output chan<- *record.Record) error {
// Create a default record with context
rc := &record.Record{Context: ctx}
rc.SetContextValue(string(task.CtxKeyFileNameWrite), textutil.SlugifyFileName(filepath.Base(path)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a single variable for the value generated by textutil.SlugifyFileName and reuse it wherever needed? This would help avoid code duplication and improve maintainability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mahesh Kamble (@ma-gk) SlugifyFileName is used only once, the other one is SlugifyFilePath.
Do you mean the same or something else?

Comment thread internal/pkg/textutil/slugify.go Outdated
- Extract URL scheme stripping in SlugifyFilePath into a stripURLScheme helper.
- Hoist the slugified file name into a local variable in the file task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In read mode, two values are stored in each record's context:

- `CATERPILLAR_FILE_NAME_WRITE` — the sanitized base filename. The stem is lowercased with non-alphanumeric characters replaced by underscores, while the extension is preserved and lowercased (e.g. `"Report 1.CSV"` → `"report_1.csv"`).
- `CATERPILLAR_FILE_PATH_WRITE` — the sanitized full source path with directory hierarchy preserved. Each segment is slugified the same way; the final segment keeps its extension; URL schemes such as `s3://bucket/` are stripped (e.g. `s3://my-bucket/ReportType=A/Folder 1/data.CSV` → `reporttype_a/folder_1/data.csv`). Reference it in the destination of a downstream write task to avoid collisions when reading nested directories with a recursive glob.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any use case/example where we would be leveraging this sluggified file path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants