Skip to content

Harden telemetry delivery and refine recipe metadata#2443

Open
bmehta001 wants to merge 34 commits into
mainfrom
bhamehta/fix-telemetry-deadlock
Open

Harden telemetry delivery and refine recipe metadata#2443
bmehta001 wants to merge 34 commits into
mainfrom
bhamehta/fix-telemetry-deadlock

Conversation

@bmehta001

@bmehta001 bmehta001 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Same-repository replacement for #2441 so Azure CI runs with the normal microsoft/Olive PR secret context.

Summary

  • harden Olive telemetry delivery and cache behavior for non-CI runs, including Windows file locking, cache-path handling, and unreadable flush preservation
  • keep CI telemetry ephemeral: CI emits only OliveRecipe and does not register local cache callbacks
  • prevent Docker workflows from double-counting OliveRecipe by suppressing the inner container recipe event and forwarding CI detection into workflow containers
  • refine OliveRecipe metadata so target fields are emitted only for explicit targets, host fields are tracked separately, and recipe/config telemetry captures redacted explicit overrides
  • replace opaque package-config hashing with redacted package_config_overrides, while keeping package_config out of recipe_hash
  • make recipe source/hash metadata deterministic by avoiding filesystem-dependent model/resource classification
  • add or update focused telemetry, workflow, and Docker tests and refresh the privacy documentation

Why

We want telemetry to be reliable under cache contention and offline retry, while keeping CI runs temporal and recipe-only. We also want the recipe signal to be useful for analyzing explicit overrides without mixing host/target semantics, relying on opaque hashes, or depending on local filesystem state.

Validation

  • lintrunner f -a -m main
  • python -m compileall olive\\telemetry\\constants.py olive\\telemetry\\telemetry.py olive\\workflows\\run\\run.py olive\\systems\\docker\\docker_system.py test\\test_telemetry.py test\\workflows\\test_workflow_run.py test\\systems\\docker\\test_docker_system.py
  • pytest test\\test_telemetry.py test\\workflows\\test_workflow_run.py test\\systems\\docker\\test_docker_system.py -q
  • confirmed .azure_pipelines and .github do not set OLIVE_DISABLE_TELEMETRY, disable_telemetry, or --disable_telemetry

Copilot AI review requested due to automatic review settings May 5, 2026 02:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Olive's telemetry pipeline so workflow runs emit a new OliveRecipe signal with redacted recipe metadata, while also tightening cache handling and documenting the updated telemetry behavior. It fits into the workflow, CLI, and telemetry layers that capture how Olive runs recipes across local and CI environments.

Changes:

  • Adds recipe-result telemetry from workflow execution and CLI entry points, including recipe hashes, model/source metadata, host/target metadata, and redacted config/package override snapshots.
  • Reworks telemetry delivery details, including cache-path overrides, JSON-line cache storage, Windows file locking, unreadable flush preservation, and CI recipe-only filtering.
  • Adds focused workflow and telemetry tests, plus privacy-documentation updates.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/workflows/test_workflow_run.py Adds coverage for recipe-result metadata emitted by workflow runs.
test/test_telemetry.py Adds tests for cache-path overrides, CI filtering, unreadable flush handling, and Windows file locking.
olive/workflows/run/run.py Builds and emits recipe telemetry metadata during workflow execution.
olive/telemetry/utils.py Reworks cross-platform exclusive file locking for telemetry cache access.
olive/telemetry/telemetry.py Adds OliveRecipe, CI filtering, and cache/flush behavior changes.
olive/telemetry/telemetry_extensions.py Adds a helper for logging recipe results.
olive/telemetry/library/telemetry_logger.py Adds locking around singleton logger setup and configurable service name.
olive/telemetry/library/options.py Extends exporter options with service_name.
olive/telemetry/constants.py Updates the encoded telemetry connection string.
olive/cli/run.py Passes workflow-run recipe metadata into telemetry.
olive/cli/base.py Passes generated-CLI recipe metadata from higher-level commands.
docs/Privacy.md Updates the privacy text for the new recipe telemetry behavior.

Comment thread olive/workflows/run/run.py Outdated
Comment thread olive/telemetry/telemetry.py Outdated
Comment thread olive/telemetry/telemetry.py
Comment thread olive/workflows/run/run.py Outdated
Comment thread olive/workflows/run/run.py Outdated
Comment thread docs/Privacy.md Outdated
@bmehta001 bmehta001 self-assigned this May 5, 2026
@bmehta001

Copy link
Copy Markdown
Contributor Author

/AzurePipelines help

@azure-pipelines

Copy link
Copy Markdown
Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@bmehta001

Copy link
Copy Markdown
Contributor Author

/AzurePipelines run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@bmehta001 bmehta001 force-pushed the bhamehta/fix-telemetry-deadlock branch from e47e0ce to befbb08 Compare May 9, 2026 03:47
@bmehta001 bmehta001 requested a review from shaahji May 11, 2026 18:15

@shaahji shaahji left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Have you considered using a local db solution like sqllite3 or tinydb for storing data? These might have a lot of added benefits compared to holding information in a file on disk.
  • Olive config paths can contain pathlib.Path values. Default json package doesn't like these when invoked with calls to json.dump or json.dumps. You might want to check for those.

Comment thread olive/cli/run.py Outdated
Comment on lines +58 to +59
run_config_input = self.args.run_config
run_config = run_config_input

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want a deepcopy here

Comment on lines +44 to +47
with cls._singleton_lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._initialize(options)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth considering an instance per thread rather than dealing with multi-threading/multi-process issues in all the telemetry APIs. Most of Olive is single-threaded so race-conditions would be limited in APIs but once we start collecting more granular telemetry, this might become an issue.

if connection_string:
options = OneCollectorExporterOptions(connection_string=connection_string)
cls._default_logger = cls(options=options)
with cls._singleton_lock:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are protecting two separate resources cls._instance and cls._default_logger with the same synchronization primitive. These are two completely independent resources with absolutely no dependencies. Sharing the primitive is creating an unnecessary one.

Comment on lines +182 to +184
def get_telemetry_logger(
connection_string: Optional[str] = None, service_name: Optional[str] = None
) -> TelemetryLogger:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a class method as well since all it is accessing is other class variables and methods.

Comment thread olive/telemetry/telemetry.py Outdated
Comment on lines +143 to +146
# Single condition protects all shared state: _shutdown, _is_flushing,
# _callbacks_item_count, _events_logged. Using one lock eliminates
# lock ordering issues that arise with separate locks.
self._condition = threading.Condition()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of multi-processing? Does each process get its own instance of Telemetry?

Comment thread olive/telemetry/utils.py Outdated
from typing import Optional
from typing import ClassVar

if os.name == "nt":

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering using established packages. Here's one - https://pypi.org/project/filelock/

Comment on lines +43 to +46
"model_path",
"_name_or_path",
"adapter_path",
"user_script",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be conditionally redacted. If using a hf model or data from local hf cache, it would be worth collecting the model names.

return None


def _build_package_config_overrides(config_input: Any) -> Optional[str]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to run this for invocation. Use it only if the user provided an override on the command line.

Comment on lines +295 to +296
or PureWindowsPath(identifier).is_absolute()
or PurePosixPath(identifier).is_absolute()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input path don't have to be absolute. Local directory paths i.e. relative to your current working directory are allowed.

or PureWindowsPath(identifier).resolve().exists()
or PurePosixPath(identifier).resolve().exists()

Comment thread olive/telemetry/recipe_telemetry.py Outdated
model_metadata = _extract_input_model_metadata(run_config_json["input_model"])
target_metadata = _extract_target_metadata(run_config)
host_metadata = _extract_host_metadata(run_config)
pass_types = [pass_config.type for pass_config in _get_used_passes_configs(run_config)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Iteration can be folded into the call to _get_used_passes_configs

@bmehta001 bmehta001 force-pushed the bhamehta/fix-telemetry-deadlock branch from f8d1701 to 4a91dda Compare May 22, 2026 15:00
bmehta001 added a commit that referenced this pull request May 26, 2026
Address review findings on PR #2443:

- Track per-flush replay failures so the flush file is preserved when any replayed event was rejected, instead of being silently deleted alongside the cached events.

- Remove the unreachable OSError retry/backoff in _write_payload_to_cache: the exclusive file lock is blocking, so the retry tier never fired.

- Replace the polling shutdown wait with a condition-variable wait_until_flush_complete helper and notify_all when _is_flushing clears.

- Add focused tests covering replay success deletes the flush file, replay failure restores it, callback timeout restores it, and the new wait helper wakes on notify and honors its timeout.

Files changed:
- olive/telemetry/telemetry.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bmehta001 and others added 17 commits June 10, 2026 23:04
Replace separate _lock and _callback_condition with a single
_condition (threading.Condition) that protects all shared state:
_shutdown, _is_flushing, _callbacks_item_count, _events_logged.

The two-lock design had a lock order inversion: on_payload_transmitted
acquired _lock then _callback_condition, while wait_for_callbacks
acquired _callback_condition then _lock (via is_flushing). This could
deadlock under concurrent flush + callback scenarios.

Using one lock eliminates the ordering issue entirely and simplifies
the code — the on_payload_transmitted callback no longer needs nested
lock acquisition.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wait_for_callbacks: Hold _condition lock continuously from condition
check through wait() to prevent missing notifications. Previously the
lock was released between checking and waiting, allowing notify_all()
to fire in the gap.

TelemetryLogger: Add RLock to __new__ and get_default_logger to
prevent race conditions when multiple threads create the singleton
simultaneously. Uses RLock because get_default_logger calls __new__
which both need the same lock.

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/library/telemetry_logger.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Remove base64 encoding from cache: write raw JSON lines instead.
   The base64 layer added 33% size overhead and prevented human
   inspection during debugging with no security benefit (cache dir
   is user-owned). Cache file is now directly readable.

2. Simplify cache flush: remove the .flush file rename dance
   (atomic rename, restore on failure, stale file cleanup). For ~5
   events/session, simple lock-read-send-delete is sufficient. On
   failure the cache file persists for next retry.

3. Simplify Telemetry singleton: remove double-checked locking in
   __new__. The lock is cheap and called once at startup; the outer
   check saved a lock acquisition but added complexity.

Net: -83 lines.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 1: Remove duplicate callback count increment in
on_payload_transmitted when _is_flushing is true. The count was
incremented both in the early-return path and in the finally block,
causing wait_for_callbacks to think events completed before they
actually did. Now only the finally block increments (which always
runs, even on return).

Fix 2: Replace flush_path.replace(cache_path) with new
_restore_flush_file() method that appends flush entries into the
cache file instead of overwriting it. This prevents losing events
written by another process while a flush was in progress.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Emit a single OliveRecipe event per workflow so recipe usage and failures can be measured without double-counting nested action failures. Keep CI in recipe-only mode, make Olive app naming explicit, update the telemetry ingestion key, and harden the cache/locking path that the new telemetry depends on.

Files changed:
- docs/Privacy.md
- olive/cli/base.py
- olive/cli/run.py
- olive/telemetry/constants.py
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- olive/telemetry/telemetry.py
- olive/telemetry/telemetry_extensions.py
- olive/telemetry/utils.py
- olive/workflows/run/run.py
- test/test_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace optional runtime imports with importlib-based lookups so the recent telemetry changes stay lint-clean without adding new noqa markers. Keep the focused telemetry tests import-sorted and ready for CI.

Files changed:
- olive/cli/base.py
- olive/workflows/run/run.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep exporter options focused on transport/export concerns and move service.name defaults into the logger/resource layer where they belong. This keeps Olive's explicit app name override separate from the shared logger fallback and removes unnecessary plumbing.

Files changed:
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep Olive''s explicit service-name override in the logger path, but restore the previous shared-library fallback and compatibility behavior so this cleanup does not broaden unrelated API or default changes.

Files changed:
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trim the branch back to telemetry behavior and the minimal plumbing needed to support it. Restore the original local-import patterns in CLI and workflow code, keep only targeted lint suppressions where those restored lines are required, and simplify the telemetry logger app-name plumbing without changing the feature behavior.

Files changed:
- olive/cli/base.py
- olive/cli/run.py
- olive/telemetry/library/telemetry_logger.py
- olive/workflows/run/run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix the correctness and security issues raised on PR #2441 by handling empty telemetry cache-dir overrides safely, preserving non-empty unreadable flush files instead of deleting them, restoring the legacy .json.flush naming pattern, handling non-pathlike recipe config inputs without masking the original error, and cleaning up the Windows ctypes import pattern for CodeQL.

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/utils.py
- olive/workflows/run/run.py
- test/test_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drop the explicit service.name wiring test from test/test_telemetry.py since it is mostly implementation-detail coverage and does not protect the higher-value telemetry behavior changes on this branch.

Files changed:
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve the remaining github-advanced-security findings by removing unused PLC0415 noqa markers, keeping the optional-import behavior via minimal import cleanup, and updating the telemetry tests to satisfy the protected-access and consider-using-with lint comments without changing the tested behavior.

Files changed:
- olive/cli/base.py
- olive/cli/run.py
- olive/workflows/run/run.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove dead base64 cache helpers from olive/telemetry/utils.py and move exception formatting next to the telemetry extension code that actually uses it. Keep the Win32 locking and cache-dir logic intact while reducing unrelated utility clutter.

Files changed:
- olive/telemetry/utils.py
- olive/telemetry/telemetry_extensions.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute the CI environment flag once during Telemetry initialization and reuse it for recipe-only gating and heartbeat suppression instead of calling the check twice back-to-back.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep target telemetry fields only for explicitly configured targets, add separate host fields, remove the ambiguous input model name hash, and add redacted config-overrides plus package-config hash metadata so recipe telemetry can show which overrides users actually provide without folding environment-specific package config into recipe_hash.

Files changed:
- docs/Privacy.md
- olive/telemetry/telemetry.py
- olive/workflows/run/run.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log redacted package-config overrides instead of an opaque hash so Olive telemetry captures the specific package settings users changed, while still excluding package_config from recipe_hash and avoiding raw module-path leakage.

Files changed:
- docs/Privacy.md
- olive/telemetry/telemetry.py
- olive/workflows/run/run.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skip Hugging Face and Docker logins when PR secrets are unavailable or unresolved so Azure unit-test jobs do not fail before tests start on fork-style runs, while still preserving normal login behavior when valid credentials are present.

Files changed:
- .azure_pipelines/job_templates/build-docker-image-template.yaml
- .azure_pipelines/job_templates/huggingface-login-template.yaml
- .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml
- .azure_pipelines/scripts/run_test.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bmehta001 and others added 12 commits June 10, 2026 23:06
Revert the Azure pipeline login guard changes because the pipeline behavior should match main and those changes are not necessary for the telemetry PR.

Files changed:
- .azure_pipelines/job_templates/build-docker-image-template.yaml
- .azure_pipelines/job_templates/huggingface-login-template.yaml
- .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml
- .azure_pipelines/scripts/run_test.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Avoid duplicate OliveRecipe telemetry from Docker workflows by suppressing inner workflow recipe events and forwarding CI detection into workflow containers. Keep CI telemetry ephemeral by skipping cache setup, and make recipe metadata stable by avoiding filesystem-sensitive model/resource classification.

Files changed:

- docs/Privacy.md

- olive/systems/docker/docker_system.py

- olive/telemetry/constants.py

- olive/telemetry/telemetry.py

- olive/workflows/run/run.py

- test/systems/docker/test_docker_system.py

- test/test_telemetry.py

- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep CLI workflow imports patchable so existing CLI tests and command mocks still intercept workflow execution. Update CLI test expectations for recipe telemetry metadata, and make the CI-sensitive workflow telemetry assertion deterministic.

Files changed:

- olive/cli/base.py

- olive/cli/run.py

- test/cli/test_cli.py

- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep OliveRecipe focused on recipe outcome metadata and use the existing log_error path for workflow exceptions. This avoids duplicating exception fields on recipe events while preserving detailed formatted exception messages in error telemetry.

Files changed:

- olive/telemetry/telemetry_extensions.py

- olive/telemetry/telemetry.py

- olive/workflows/run/run.py

- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep optional and workflow-heavy imports lazy so generated CLI commands preserve existing import behavior while still reporting recipe telemetry.

Files changed:
- olive/cli/base.py
- olive/cli/run.py
- olive/workflows/run/run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use an explicit run() parameter for the inner Docker workflow runner instead of an environment variable so only the parent workflow emits the recipe result event.

Files changed:
- olive/workflows/run/run.py
- olive/systems/docker/workflow_runner.py
- olive/systems/docker/docker_system.py
- olive/telemetry/constants.py
- test/workflows/test_workflow_run.py
- test/systems/docker/test_docker_system.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore local imports for Docker token lookup and telemetry file locking to avoid unnecessary module-level import churn.

Files changed:
- olive/systems/docker/docker_system.py
- olive/telemetry/utils.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep olive/workflows/run/run.py focused on workflow execution by moving recipe metadata classification, sanitization, and hashing helpers into a dedicated workflow telemetry module.

Files changed:
- olive/workflows/run/run.py
- olive/workflows/run/recipe_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep workflow recipe metadata helpers with the telemetry code while leaving telemetry_extensions focused on generic event logging APIs.

Files changed:
- olive/telemetry/recipe_telemetry.py
- olive/workflows/run/run.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tighten singleton locking, simplify CI/cache helpers, clarify per-process telemetry behavior, and update privacy wording for the default host metadata.

Files changed:
- docs/Privacy.md
- olive/telemetry/telemetry.py
- olive/telemetry/library/telemetry_logger.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Avoid treating the full workflow config as a telemetry override and keep recipe metadata deterministic and privacy-preserving across Path and model source inputs.

Files changed:
- olive/cli/run.py
- olive/telemetry/recipe_telemetry.py
- olive/workflows/run/run.py
- test/cli/test_cli.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review findings on PR #2443:

- Track per-flush replay failures so the flush file is preserved when any replayed event was rejected, instead of being silently deleted alongside the cached events.

- Remove the unreachable OSError retry/backoff in _write_payload_to_cache: the exclusive file lock is blocking, so the retry tier never fired.

- Replace the polling shutdown wait with a condition-variable wait_until_flush_complete helper and notify_all when _is_flushing clears.

- Add focused tests covering replay success deletes the flush file, replay failure restores it, callback timeout restores it, and the new wait helper wakes on notify and honors its timeout.

Files changed:
- olive/telemetry/telemetry.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 force-pushed the bhamehta/fix-telemetry-deadlock branch from 23eb28c to 35d3552 Compare June 11, 2026 04:21
bmehta001 and others added 5 commits June 26, 2026 03:10
Replace the OpenTelemetry/OneCollector cache-file telemetry with a
standard-library-only pipeline that mirrors onnxruntime-genai, so both
projects share one design and Olive carries no telemetry dependencies.

Pipeline:
- Events serialize to Common Schema JSON and are written to a durable
  per-app SQLite queue (offline_store.py); a background daemon uploader
  (uploader.py) drains it to OneCollector over urllib, deleting on 2xx,
  dropping poison 4xx, and retaining transient 5xx/network failures for
  the next cycle or run. Durability removes any exit-time flush.
- A single-drainer advisory lock (process_lock.py) makes concurrent
  Olive processes sharing the database safe: only one drains at a time.

Three-state opt-out (device counting keeps working when users opt out,
but automated pipelines stay silent):
- CI/testing: recipe-only mode is preserved (OliveRecipe still sent), but
  no device-id heartbeat and no action/error events.
- User opt-out (OLIVE_DISABLE_TELEMETRY=1, also set by the CLI
  --disable-telemetry flag): send only the device-id heartbeat; suppress
  all detailed events (no store, no uploader). Opt-out + CI sends nothing.
- Enabled: heartbeat plus all events.

The heartbeat is a direct best-effort POST on a daemon thread, bypassing
the durable store, so an opt-out run uploads only the heartbeat and never
drains previously queued detailed events. ALLOWED_KEYS whitelist
filtering is shared by the store and heartbeat paths via _build_payload.

Remove the now-dead OneCollector exporter, retry helper, telemetry_logger
and the cache-file LockFileEx machinery in utils.py (superseded by SQLite
+ process_lock). Drop opentelemetry-sdk from requirements. Rewrite the
telemetry tests for the SQLite model and the three-state semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address findings from a multi-specialist (privacy/correctness) review of
the telemetry migration.

Correctness:
- Guard Telemetry.__init__ with the class lock (now an RLock) and set
  _initialized inside it. Previously the body ran without the lock, so two
  threads whose first Telemetry() calls interleaved could both execute it,
  creating two uploaders and sending two device-id heartbeats (inflating
  the MAD/DAD count this feature measures) plus an orphaned uploader
  thread. Mirrors the onnxruntime-genai implementation.
- The action decorator resolves invoked_from/action_name inside
  try/except so instrumentation (incl. inspect.stack()) cannot propagate
  into the wrapped call.

Privacy:
- _format_exception_message now splits each traceback frame into physical
  lines and redacts every line via a new _redact_paths helper. Previously
  only the File line's filename was trimmed, so an absolute path on the
  offending source line (or the exception message) leaked a username into
  OliveError. _redact_paths matches Windows, UNC, and POSIX paths and
  drops path tails that are directories/usernames, keeping only a real
  filename.

Docs:
- Privacy.md corrected after the stdlib migration: drop the now-false
  "uses the OpenTelemetry API" claim and the no-longer-honored
  OLIVE_TELEMETRY_CACHE_DIR override, and disclose that a minimal
  device-id/OS heartbeat is still sent on opt-out outside CI/CD.

Add regression tests for path redaction (26 total).

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/telemetry_extensions.py
- docs/Privacy.md
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the onnxruntime-genai change: write the device-id heartbeat to the
durable SQLite queue instead of a fire-and-forget direct POST, so the
uploader retries it until delivery and devices are not undercounted when
offline at startup or on short-lived runs.

The queue is created on every non-CI run (including opt-out); detailed events
are still recorded only when enabled, so opted-out users contribute only the
heartbeat. Opt-out combined with CI records and sends nothing; CI alone stays
recipe-only with no heartbeat. Removes the separate direct-transport path.

Tests assert on sent event names (the heartbeat now batches through the
uploader) rather than a distinct item_count==1 send.

Files changed:
- olive/telemetry/telemetry.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
So the same analytics query works across both products, give the
OliveHeartbeat the same flat device/OS field names as the GenAI heartbeat
(which is modeled on Foundry Local's DeviceIdEvent): id_status ->
device_id_status, and nested os.{name,version,release,arch} ->
os/os_version/os_release/os_arch. Update ALLOWED_KEYS, the heartbeat
builder, and the corresponding test.

Files changed:
- olive/telemetry/telemetry.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The session-scoped autouse disable_telemetry fixture constructs Telemetry()
to turn telemetry off for the suite. Now that the device-id heartbeat is
durable, merely constructing it enqueues a heartbeat to the real on-disk
store and the uploader attempts to send it -- a test run was writing
olive_telemetry.db under the real user profile.

Redirect the telemetry base dir to a throwaway pytest tmp dir and stub the
HTTP transport for the session so tests never touch the real store or the
network. Verified: the real store is no longer created by a test run.

Files changed:
- test/conftest.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Server responded with a non-2xx status (4xx/5xx): not retried here.
try:
http_err.read()
except Exception:
def _initialize(self) -> None:
try:
os.makedirs(os.path.dirname(self._db_path), exist_ok=True)
except Exception:
try:
self._conn.executemany("DELETE FROM events WHERE id=?", [(i,) for i in ids])
self._conn.commit()
except Exception:
if self._conn is not None:
try:
self._conn.close()
except Exception:
"""

import os
from typing import Optional
try:
fh.seek(0)
msvcrt.locking(fh.fileno(), msvcrt.LK_UNLCK, 1)
except Exception:

try:
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
except Exception:
finally:
try:
fh.close()
except Exception:
Comment thread test/conftest.py
# throwaway directory and stub the HTTP transport so no test run writes to
# the real telemetry store or reaches the network.
import olive.telemetry.library.transport as transport_module
import olive.telemetry.telemetry as telemetry_module
Comment thread test/test_telemetry.py
import pytest

import olive.telemetry.library.transport as transport_mod
import olive.telemetry.telemetry as tmod

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants