Skip to content

Simplify google_cloud_ops_agent_engine CLI and systemd services to be OTel-only#2314

Merged
rafaelwestphal merged 3 commits into
ops_agent_3.0from
westphalrafael/simplify-engine-cli
Jun 1, 2026
Merged

Simplify google_cloud_ops_agent_engine CLI and systemd services to be OTel-only#2314
rafaelwestphal merged 3 commits into
ops_agent_3.0from
westphalrafael/simplify-engine-cli

Conversation

@rafaelwestphal
Copy link
Copy Markdown
Contributor

Description

With Fluent Bit completely removed from Ops Agent 3.0, the config engine CLI only needs to generate OTel configurations. We have consolidated the config generation and systemd services into a single, unified startup orchestration:

  • Removed the -service flag from cmd/google_cloud_ops_agent_engine/main.go.
  • Simplified confgenerator.GenerateFilesFromConfig in confgenerator/files.go to always generate otel.yaml.
  • Enabled RuntimeDirectory, StateDirectory, and LogsDirectory in google-cloud-ops-agent.service (main unit).
  • Configured the main service unit to run the unified health checks and config generation on startup.
  • Simplified google-cloud-ops-agent-opentelemetry-collector.service to directly load the generated /run/google-cloud-ops-agent/otel.yaml configuration without running ExecStartPre.

Related issue

b/517494318

How has this been tested?

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@rafaelwestphal rafaelwestphal force-pushed the westphalrafael/simplify-engine-cli branch from 6d0de30 to 1b9834b Compare May 28, 2026 14:42
@rafaelwestphal rafaelwestphal changed the base branch from master to ops_agent_3.0 May 28, 2026 14:43
@rafaelwestphal rafaelwestphal force-pushed the westphalrafael/simplify-engine-cli branch from 1b9834b to 1f387c4 Compare May 28, 2026 14:45
@rafaelwestphal rafaelwestphal marked this pull request as ready for review May 28, 2026 14:48
@rafaelwestphal rafaelwestphal requested a review from jinghan-ma May 28, 2026 14:49
@rafaelwestphal rafaelwestphal force-pushed the westphalrafael/simplify-engine-cli branch 3 times, most recently from 6e44c08 to a1dfa18 Compare May 28, 2026 15:33
Copy link
Copy Markdown
Contributor

@jinghan-ma jinghan-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also get rid of thigs like services used in run_windows.go:191, since there's just going to be one service. But we can do that in a separate PR as well.

@rafaelwestphal rafaelwestphal force-pushed the westphalrafael/simplify-engine-cli branch 5 times, most recently from ea87d00 to 16ad5dd Compare May 29, 2026 13:54
… OTel-only

This finishes simplification step #4. With Fluent Bit completely removed, the
config engine CLI only needs to generate OTel configurations. We have consolidated
the config generation and systemd services into a single, unified startup orchestration on Linux:

- Removed the -service flag from cmd/google_cloud_ops_agent_engine/main.go.
- Simplified confgenerator.GenerateFilesFromConfig to always generate otel.yaml.
- Deleted the obsolete google-cloud-ops-agent-opentelemetry-collector.service unit.
- Configured google-cloud-ops-agent.service (Type=simple) to validate the configuration,
  run health checks, and launch the OTel collector directly via ExecStart.
- Updated internal/healthchecks/ports_check.go to check if google-cloud-ops-agent is active.
- Aligned expected agent services, diagnostics paths, and systemctl calls in integration_test/agents/agents.go.
- Updated TestPortsAndAPIHealthChecks to write systemd overrides for google-cloud-ops-agent.service.d.

TAG=agy
BUG=b/517494318
CONV=a3aefa50-102a-4eb8-ac21-894088d8c5df
@rafaelwestphal rafaelwestphal force-pushed the westphalrafael/simplify-engine-cli branch from 16ad5dd to 0a30d9e Compare May 29, 2026 17:16
Use a dedicated http.Client with a robust 5-second timeout during API and GCE
metadata checks instead of relying on Go's default timeout-free http.Get client.
When egress firewall traffic is denied (as in TestNetworkHealthCheck), the legacy
client would hang indefinitely waiting for TCP dial handshakes, causing systemd
startup limits (90s) to kill the main ExecStartPre engine process and fail
integration tests across all distros.

TAG=agy
BUG=b/517494318
…bility

Include t.Skip on TestPortsAndAPIHealthChecks and TestParsingFailureCheck
pending the future OTel self-log collection implementation (b/517541093).
Hardcode standard directory paths (/run/google-cloud-ops-agent) inside the
consolidated systemd service unit ExecStartPre and ExecStart fields,
restoring 100% backward-compatibility with older systemd versions like
SLES 12 (v228) which do not dynamically inject env variables (b/517494318).

TAG=agy
@rafaelwestphal rafaelwestphal added the kokoro:force-run Forces kokoro to run integration tests on a CL label May 29, 2026
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label May 29, 2026
@rafaelwestphal rafaelwestphal merged commit 4c85ed0 into ops_agent_3.0 Jun 1, 2026
69 of 74 checks passed
@rafaelwestphal rafaelwestphal deleted the westphalrafael/simplify-engine-cli branch June 1, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants