FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key#72
Closed
narayan-pattern wants to merge 1 commit into
Closed
Conversation
Expose the full source path (slugified into a flat, slash-free filename segment) as a record context value, alongside the existing CATERPILLAR_FILE_NAME_WRITE (base name only). Set wherever a read emits records: file (S3 + local), sftp download, and archive (tar/zip) unpack. Value is textutil.SlugifyFileName(<full path>), so separators collapse to underscores and the extension is preserved. Lets a write destination encode the whole source path, which also avoids same-name collisions when reading nested folders with a recursive glob. Docs updated for the file, sftp, and archive tasks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new built-in record context key, CATERPILLAR_FILE_PATH_WRITE, intended to preserve more source provenance (and avoid same-leaf-name collisions) by exposing a slugified version of the full source path for file-like sources.
Changes:
- Introduces
CATERPILLAR_FILE_PATH_WRITEas a new built-in context key. - Populates the new key when downloading/reading via
file,sftp(download), andarchive(unpack) tasks usingtextutil.SlugifyFileName. - Updates task READMEs to document the new context key and recommended usage to reduce overwrites.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/pipeline/task/task.go | Adds the new CtxKeyFilePathWrite constant for the record context key. |
| internal/pkg/pipeline/task/file/file.go | Sets CATERPILLAR_FILE_PATH_WRITE for each read record (slugified full path). |
| internal/pkg/pipeline/task/sftp/operations.go | Sets CATERPILLAR_FILE_PATH_WRITE for each downloaded record (slugified remote path). |
| internal/pkg/pipeline/task/archive/zip.go | Sets CATERPILLAR_FILE_PATH_WRITE for each unpacked ZIP entry (slugified entry name/path). |
| internal/pkg/pipeline/task/archive/tar.go | Sets CATERPILLAR_FILE_PATH_WRITE for each unpacked TAR entry (slugified entry name/path). |
| internal/pkg/pipeline/task/file/README.md | Documents the new key and how it can be used for destination naming. |
| internal/pkg/pipeline/task/sftp/README.md | Documents the new key for SFTP download and recommends it to reduce collision overwrites. |
| internal/pkg/pipeline/task/archive/README.md | Documents the new key for archive-unpack entry paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| In read mode, the sanitized base filename is stored in the record context under the key `CATERPILLAR_FILE_NAME_WRITE`. The stem is lowercased with non-alphanumeric characters replaced by underscores, while the extension is preserved and lowercased (e.g. `"Report 1.CSV"` → `"report_1.csv"`). | ||
|
|
||
| The full source path is also stored under `CATERPILLAR_FILE_PATH_WRITE`, sanitized with the same rules. Because slashes and other separators collapse to underscores, the value is a single flat filename segment with no directory structure (e.g. `s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv` → `s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv`). Use it when you need a unique destination name that encodes the whole source path. The extension is preserved, so don't re-append it in the destination `path`. |
| **`task_concurrency` opens multiple connections.** Setting `task_concurrency` above `1` makes the task open one SSH connection per worker and transfer files in parallel. This is faster, but it increases memory use (more files in memory at once), and some servers limit the number of connections per user. | ||
|
|
||
| **Nested directories are not preserved.** Each file is identified by its base name only. If you download files from nested folders with a recursive glob (for example `/data/**/*.csv`), they all land in the single destination directory, and files with the same name overwrite each other. To keep a folder structure, use a separate `file → sftp` pair for each folder, with the target path set for each one. | ||
| **Nested directories are not preserved.** If you template the destination with `CATERPILLAR_FILE_NAME_WRITE`, each file is identified by its base name only: download files from nested folders with a recursive glob (for example `/data/**/*.csv`) and they all land in the single destination directory, where files with the same name overwrite each other. To avoid collisions, template with `CATERPILLAR_FILE_PATH_WRITE` instead — it encodes the full source path into the (flat) filename, so same-named files from different folders stay distinct. To keep an actual folder structure on the target, use a separate `file → sftp` pair for each folder, with the target path set for each one. |
Author
|
Closing this PR as it will require more efforts for development |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a new built-in record context key,
CATERPILLAR_FILE_PATH_WRITE, that exposes the full source path of a read file, not just its base name.Why: Our ingestion pipelines need the destination filename to encode the entire source path, not just the leaf filename. Today the only built-in is
CATERPILLAR_FILE_NAME_WRITE, which isSlugifyFileName(filepath.Base(path)), it discards the directory structure. When we read nested folders with a recursive glob (e.g.s3://.../reportType=X/**/**.tsv), same-named files from different folders collide and overwrite each other on write, and we lose source provenance in the destination name.What it does: Stores the full source path under
CATERPILLAR_FILE_PATH_WRITE, run through the sametextutil.SlugifyFileNameas the existing key, so separators collapse to underscores (flat, single filename segment, no directories), it's lowercased, and the extension is preserved.Example:
s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv→
s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsvUsage:
Types of changes
Checklist