Skip to content

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key#72

Closed
narayan-pattern wants to merge 1 commit into
mainfrom
feat/file-path-write-context
Closed

FEAT: Add CATERPILLAR_FILE_PATH_WRITE context key#72
narayan-pattern wants to merge 1 commit into
mainfrom
feat/file-path-write-context

Conversation

@narayan-pattern

Copy link
Copy Markdown

Description

Adds a new built-in record context key, CATERPILLAR_FILE_PATH_WRITE, that exposes the full source path of a read file, not just its base name.

Why: Our ingestion pipelines need the destination filename to encode the entire source path, not just the leaf filename. Today the only built-in is CATERPILLAR_FILE_NAME_WRITE, which is SlugifyFileName(filepath.Base(path)), it discards the directory structure. When we read nested folders with a recursive glob (e.g. s3://.../reportType=X/**/**.tsv), same-named files from different folders collide and overwrite each other on write, and we lose source provenance in the destination name.

What it does: Stores the full source path under CATERPILLAR_FILE_PATH_WRITE, run through the same textutil.SlugifyFileName as the existing key, so separators collapse to underscores (flat, single filename segment, no directories), it's lowercased, and the extension is preserved.

Example:
s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv
s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv

Usage:

- name: write_fixed_tsv
  type: file
  path: s3://{{ s3_path_env_based("pattern-dl-raw", "local") }}/ingestor-files/amazon/fba_inventory_planning/ds={{ ds }}/{{ context "CATERPILLAR_FILE_PATH_WRITE" }}

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation and I have updated the documentation accordingly.
  • I have added tests to cover my changes.

Expose the full source path (slugified into a flat, slash-free filename
segment) as a record context value, alongside the existing
CATERPILLAR_FILE_NAME_WRITE (base name only).

Set wherever a read emits records: file (S3 + local), sftp download, and
archive (tar/zip) unpack. Value is textutil.SlugifyFileName(<full path>),
so separators collapse to underscores and the extension is preserved.

Lets a write destination encode the whole source path, which also avoids
same-name collisions when reading nested folders with a recursive glob.

Docs updated for the file, sftp, and archive tasks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 10, 2026 10:20
@narayan-pattern narayan-pattern requested a review from a team as a code owner June 10, 2026 10:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new built-in record context key, CATERPILLAR_FILE_PATH_WRITE, intended to preserve more source provenance (and avoid same-leaf-name collisions) by exposing a slugified version of the full source path for file-like sources.

Changes:

  • Introduces CATERPILLAR_FILE_PATH_WRITE as a new built-in context key.
  • Populates the new key when downloading/reading via file, sftp (download), and archive (unpack) tasks using textutil.SlugifyFileName.
  • Updates task READMEs to document the new context key and recommended usage to reduce overwrites.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/pkg/pipeline/task/task.go Adds the new CtxKeyFilePathWrite constant for the record context key.
internal/pkg/pipeline/task/file/file.go Sets CATERPILLAR_FILE_PATH_WRITE for each read record (slugified full path).
internal/pkg/pipeline/task/sftp/operations.go Sets CATERPILLAR_FILE_PATH_WRITE for each downloaded record (slugified remote path).
internal/pkg/pipeline/task/archive/zip.go Sets CATERPILLAR_FILE_PATH_WRITE for each unpacked ZIP entry (slugified entry name/path).
internal/pkg/pipeline/task/archive/tar.go Sets CATERPILLAR_FILE_PATH_WRITE for each unpacked TAR entry (slugified entry name/path).
internal/pkg/pipeline/task/file/README.md Documents the new key and how it can be used for destination naming.
internal/pkg/pipeline/task/sftp/README.md Documents the new key for SFTP download and recommends it to reduce collision overwrites.
internal/pkg/pipeline/task/archive/README.md Documents the new key for archive-unpack entry paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


In read mode, the sanitized base filename is stored in the record context under the key `CATERPILLAR_FILE_NAME_WRITE`. The stem is lowercased with non-alphanumeric characters replaced by underscores, while the extension is preserved and lowercased (e.g. `"Report 1.CSV"` → `"report_1.csv"`).

The full source path is also stored under `CATERPILLAR_FILE_PATH_WRITE`, sanitized with the same rules. Because slashes and other separators collapse to underscores, the value is a single flat filename segment with no directory structure (e.g. `s3://prod-bucket/reports/type=X/2026-06-10/sub/Report (1).tsv` → `s3_prod_bucket_reports_type_x_2026_06_10_sub_report_1.tsv`). Use it when you need a unique destination name that encodes the whole source path. The extension is preserved, so don't re-append it in the destination `path`.
**`task_concurrency` opens multiple connections.** Setting `task_concurrency` above `1` makes the task open one SSH connection per worker and transfer files in parallel. This is faster, but it increases memory use (more files in memory at once), and some servers limit the number of connections per user.

**Nested directories are not preserved.** Each file is identified by its base name only. If you download files from nested folders with a recursive glob (for example `/data/**/*.csv`), they all land in the single destination directory, and files with the same name overwrite each other. To keep a folder structure, use a separate `file → sftp` pair for each folder, with the target path set for each one.
**Nested directories are not preserved.** If you template the destination with `CATERPILLAR_FILE_NAME_WRITE`, each file is identified by its base name only: download files from nested folders with a recursive glob (for example `/data/**/*.csv`) and they all land in the single destination directory, where files with the same name overwrite each other. To avoid collisions, template with `CATERPILLAR_FILE_PATH_WRITE` instead — it encodes the full source path into the (flat) filename, so same-named files from different folders stay distinct. To keep an actual folder structure on the target, use a separate `file → sftp` pair for each folder, with the target path set for each one.
@narayan-pattern

Copy link
Copy Markdown
Author

Closing this PR as it will require more efforts for development

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants