FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key by narayan-pattern · Pull Request #72 · patterninc/caterpillar

narayan-pattern · 2026-06-10T10:20:32Z

Description

Adds a new built-in record context key, CATERPILLAR_FILE_PATH_WRITE, that exposes the full source path of a read file, not just its base name.

Why: Our ingestion pipelines need the destination filename to encode the entire source path, not just the leaf filename. Today the only built-in is CATERPILLAR_FILE_NAME_WRITE, which is SlugifyFileName(filepath.Base(path)), it discards the directory structure. When we read nested folders with a recursive glob (e.g. s3://.../reportType=X/**/**.tsv), same-named files from different folders collide and overwrite each other on write, and we lose source provenance in the destination name.

What it does: Stores the full source path under CATERPILLAR_FILE_PATH_WRITE, run through the same textutil.SlugifyFileName as the existing key, so separators collapse to underscores (flat, single filename segment, no directories), it's lowercased, and the extension is preserved.

Example:
s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv
→ s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv

Usage:

- name: write_fixed_tsv
  type: file
  path: s3://{{ s3_path_env_based("pattern-dl-raw", "local") }}/ingestor-files/amazon/fba_inventory_planning/ds={{ ds }}/{{ context "CATERPILLAR_FILE_PATH_WRITE" }}

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My code follows the code style of this project.
My change requires a change to the documentation and I have updated the documentation accordingly.
I have added tests to cover my changes.

Expose the full source path (slugified into a flat, slash-free filename segment) as a record context value, alongside the existing CATERPILLAR_FILE_NAME_WRITE (base name only). Set wherever a read emits records: file (S3 + local), sftp download, and archive (tar/zip) unpack. Value is textutil.SlugifyFileName(<full path>), so separators collapse to underscores and the extension is preserved. Lets a write destination encode the whole source path, which also avoids same-name collisions when reading nested folders with a recursive glob. Docs updated for the file, sftp, and archive tasks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new built-in record context key, CATERPILLAR_FILE_PATH_WRITE, intended to preserve more source provenance (and avoid same-leaf-name collisions) by exposing a slugified version of the full source path for file-like sources.

Changes:

Introduces CATERPILLAR_FILE_PATH_WRITE as a new built-in context key.
Populates the new key when downloading/reading via file, sftp (download), and archive (unpack) tasks using textutil.SlugifyFileName.
Updates task READMEs to document the new context key and recommended usage to reduce overwrites.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
internal/pkg/pipeline/task/task.go	Adds the new `CtxKeyFilePathWrite` constant for the record context key.
internal/pkg/pipeline/task/file/file.go	Sets `CATERPILLAR_FILE_PATH_WRITE` for each read record (slugified full path).
internal/pkg/pipeline/task/sftp/operations.go	Sets `CATERPILLAR_FILE_PATH_WRITE` for each downloaded record (slugified remote path).
internal/pkg/pipeline/task/archive/zip.go	Sets `CATERPILLAR_FILE_PATH_WRITE` for each unpacked ZIP entry (slugified entry name/path).
internal/pkg/pipeline/task/archive/tar.go	Sets `CATERPILLAR_FILE_PATH_WRITE` for each unpacked TAR entry (slugified entry name/path).
internal/pkg/pipeline/task/file/README.md	Documents the new key and how it can be used for destination naming.
internal/pkg/pipeline/task/sftp/README.md	Documents the new key for SFTP download and recommends it to reduce collision overwrites.
internal/pkg/pipeline/task/archive/README.md	Documents the new key for archive-unpack entry paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


 In read mode, the sanitized base filename is stored in the record context under the key `CATERPILLAR_FILE_NAME_WRITE`. The stem is lowercased with non-alphanumeric characters replaced by underscores, while the extension is preserved and lowercased (e.g. `"Report 1.CSV"` → `"report_1.csv"`).

+The full source path is also stored under `CATERPILLAR_FILE_PATH_WRITE`, sanitized with the same rules. Because slashes and other separators collapse to underscores, the value is a single flat filename segment with no directory structure (e.g. `s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv` → `s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv`). Use it when you need a unique destination name that encodes the whole source path. The extension is preserved, so don't re-append it in the destination `path`.


 **`task_concurrency` opens multiple connections.** Setting `task_concurrency` above `1` makes the task open one SSH connection per worker and transfer files in parallel. This is faster, but it increases memory use (more files in memory at once), and some servers limit the number of connections per user.

-**Nested directories are not preserved.** Each file is identified by its base name only. If you download files from nested folders with a recursive glob (for example `/data/**/*.csv`), they all land in the single destination directory, and files with the same name overwrite each other. To keep a folder structure, use a separate `file → sftp` pair for each folder, with the target path set for each one.
+**Nested directories are not preserved.** If you template the destination with `CATERPILLAR_FILE_NAME_WRITE`, each file is identified by its base name only: download files from nested folders with a recursive glob (for example `/data/**/*.csv`) and they all land in the single destination directory, where files with the same name overwrite each other. To avoid collisions, template with `CATERPILLAR_FILE_PATH_WRITE` instead — it encodes the full source path into the (flat) filename, so same-named files from different folders stay distinct. To keep an actual folder structure on the target, use a separate `file → sftp` pair for each folder, with the target path set for each one.


narayan-pattern · 2026-06-10T10:38:18Z

Closing this PR as it will require more efforts for development

Copilot AI review requested due to automatic review settings June 10, 2026 10:20

narayan-pattern requested a review from a team as a code owner June 10, 2026 10:20

Copilot started reviewing on behalf of narayan-pattern June 10, 2026 10:20 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

narayan-pattern closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key#72

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key#72
narayan-pattern wants to merge 1 commit into
mainfrom
feat/file-path-write-context

narayan-pattern commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

narayan-pattern commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		In read mode, the sanitized base filename is stored in the record context under the key `CATERPILLAR_FILE_NAME_WRITE`. The stem is lowercased with non-alphanumeric characters replaced by underscores, while the extension is preserved and lowercased (e.g. `"Report 1.CSV"` → `"report_1.csv"`).

		The full source path is also stored under `CATERPILLAR_FILE_PATH_WRITE`, sanitized with the same rules. Because slashes and other separators collapse to underscores, the value is a single flat filename segment with no directory structure (e.g. `s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv` → `s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv`). Use it when you need a unique destination name that encodes the whole source path. The extension is preserved, so don't re-append it in the destination `path`.

Uh oh!

Conversation

narayan-pattern commented Jun 10, 2026

Description

Types of changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

narayan-pattern commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants