Skip to content

fix(orchestrator): fail DB creation job on actual errors instead of silently succeeding#407

Merged
openshift-merge-bot[bot] merged 22 commits into
redhat-developer:mainfrom
Fortune-Ndlovu:RHDHBUGS-2577-helm-orchestrator-db-creation-job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
May 29, 2026
Merged

fix(orchestrator): fail DB creation job on actual errors instead of silently succeeding#407
openshift-merge-bot[bot] merged 22 commits into
redhat-developer:mainfrom
Fortune-Ndlovu:RHDHBUGS-2577-helm-orchestrator-db-creation-job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors

Conversation

@Fortune-Ndlovu
Copy link
Copy Markdown
Member

@Fortune-Ndlovu Fortune-Ndlovu commented May 19, 2026

Description of the change

The create-sonataflow-database Job was using || echo WARNING which swallowed all psql errors, causing the Job to always report success. This prevented Kubernetes from retrying via backoffLimit when there were real failures (e.g. wrong credentials).

This PR:

  • Replaces the blanket error suppression with proper handling that tolerates "database already exists" but exits non-zero on real failures
  • Uses a versioned Job name (<release>-create-sf-db-<chart-version>) instead of Helm hooks, ensuring ArgoCD compatibility and avoiding immutable field errors on upgrades
  • Makes backoffLimit, ttlSecondsAfterFinished, and activeDeadlineSeconds configurable via values.yaml

Breaking change (version 6.0.0)

The Job name changed from <release>-create-sonataflow-database to <release>-create-sf-db-<version>. On upgrade, the old Job remains as an orphan (harmless, cleaned up by TTL or manually) and the new versioned Job is created fresh.

Which issue(s) does this PR fix or relate to

https://redhat.atlassian.net/browse/RHDHBUGS-2577

How to test changes / Special notes to the reviewer

Note: The Job is part of the sonataflows.yaml template which is gated by orchestrator.enabled and serverlessLogicOperator.enabled. If your cluster does not have the SonataFlow CRDs, you can install the chart with orchestrator disabled to get PostgreSQL up, then apply the rendered Job manifest separately.

1. Fresh database creation

# Render and extract the Job manifest
helm template rhdh charts/backstage \
  --namespace rhdh \
  --set orchestrator.enabled=true \
  --set orchestrator.serverlessLogicOperator.enabled=true \
  --set global.clusterRouterBase=<your-cluster-apps-domain> \
  --show-only templates/sonataflows.yaml \
  | awk '/^apiVersion: batch\/v1$/,/^---/' | head -n -1 \
  | oc apply -n rhdh -f -

# Check the job completed
oc get job -n rhdh -l job-name=rhdh-create-sf-db-6-0-0

# Verify logs show "CREATE DATABASE"
oc logs -n rhdh -l job-name=rhdh-create-sf-db-6-0-0 -c psql

Expected: Job completes 1/1, logs show CREATE DATABASE.

2. Database already exists

# Delete the job and re-apply (database already exists from step 1)
oc delete job rhdh-create-sf-db-6-0-0 -n rhdh
# Re-apply the same Job manifest
helm template rhdh charts/backstage \
  --namespace rhdh \
  --set orchestrator.enabled=true \
  --set orchestrator.serverlessLogicOperator.enabled=true \
  --set global.clusterRouterBase=<your-cluster-apps-domain> \
  --show-only templates/sonataflows.yaml \
  | awk '/^apiVersion: batch\/v1$/,/^---/' | head -n -1 \
  | oc apply -n rhdh -f -

# Check logs
oc logs -n rhdh -l job-name=rhdh-create-sf-db-6-0-0 -c psql

Expected: Job completes 1/1, logs show Database 'sonataflow' already exists, skipping creation.

3. Failure and retry (wrong credentials)

# Delete the existing job
oc delete job rhdh-create-sf-db-6-0-0 -n rhdh

# Create a secret with a bad password
oc create secret generic bad-password -n rhdh --from-literal=password=wrong

# Render the Job and swap in the bad secret
helm template rhdh charts/backstage \
  --namespace rhdh \
  --set orchestrator.enabled=true \
  --set orchestrator.serverlessLogicOperator.enabled=true \
  --set global.clusterRouterBase=<your-cluster-apps-domain> \
  --show-only templates/sonataflows.yaml \
  | awk '/^apiVersion: batch\/v1$/,/^---/' | head -n -1 \
  | sed 's/rhdh-postgresql-svcbind-postgres/bad-password/' \
  | oc apply -n rhdh -f -

# Watch retries (backoffLimit: 2 = up to 2 retries, 3 pods total)
oc get pods -n rhdh -l job-name=rhdh-create-sf-db-6-0-0 -w

# Check logs
oc logs -n rhdh -l job-name=rhdh-create-sf-db-6-0-0 -c psql

# Verify job ultimately failed
oc get job rhdh-create-sf-db-6-0-0 -n rhdh

Expected: Job fails, retries twice (3 pods total), logs show ERROR: Failed to create database 'sonataflow'., job status shows Failed.

4. Schema validation of configurable values

# Reject negative backoffLimit
helm template rhdh charts/backstage \
  --set orchestrator.enabled=true \
  --set orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit=-1
# Expected: Error: minimum: got -1, want 0

# Reject 0 activeDeadlineSeconds
helm template rhdh charts/backstage \
  --set orchestrator.enabled=true \
  --set orchestrator.sonataflowPlatform.dbCreationJobActiveDeadlineSeconds=0
# Expected: Error: minimum: got 0, want 1

# Reject negative TTL
helm template rhdh charts/backstage \
  --set orchestrator.enabled=true \
  --set orchestrator.sonataflowPlatform.dbCreationJobTTLSecondsAfterFinished=-1
# Expected: Error: minimum: got -1, want 0

# Custom values render correctly
helm template rhdh charts/backstage \
  --set orchestrator.enabled=true \
  --set orchestrator.serverlessLogicOperator.enabled=true \
  --set orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit=5 \
  --set orchestrator.sonataflowPlatform.dbCreationJobTTLSecondsAfterFinished=600 \
  --set orchestrator.sonataflowPlatform.dbCreationJobActiveDeadlineSeconds=300 \
  --show-only templates/sonataflows.yaml \
  | grep -E "ttlSeconds|activeDeadline|backoffLimit"
# Expected: ttlSecondsAfterFinished: 600, activeDeadlineSeconds: 300, backoffLimit: 5

5. Upgrade from 5.11.1 to 6.0.0

# Install old version (main branch, v5.11.1) with the old-style Job present
# Then upgrade to 6.0.0
helm upgrade rhdh charts/backstage --namespace rhdh \
  --set global.clusterRouterBase=<your-cluster-apps-domain>

# Verify: no immutable field errors, old Job is orphaned, new Job created
oc get jobs -n rhdh

Expected: Upgrade succeeds without errors. Old rhdh-create-sonataflow-database Job remains as harmless orphan. New rhdh-create-sf-db-6-0-0 Job is created and completes successfully.

Checklist

  • For each Chart updated, version bumped in the corresponding Chart.yaml according to Semantic Versioning.
  • For each Chart updated, variables are documented in the values.yaml and added to the corresponding README.md. The pre-commit utility can be used to generate the necessary content. Run pre-commit run --all-files to run the hooks and then push any resulting changes. The pre-commit Workflow will enforce this and warn you if needed.
  • JSON Schema template updated and re-generated the raw schema via the pre-commit hook.
  • Tests pass using the Chart Testing tool and the ct lint command.
  • If you updated the orchestrator-infra chart, make sure the versions of the Knative CRDs are aligned with the versions of the CRDs installed by the OpenShift Serverless operators declared in the values.yaml file. See Installing Knative Eventing and Knative Serving CRDs for more details.

@Fortune-Ndlovu Fortune-Ndlovu requested a review from a team as a code owner May 19, 2026 07:17
@openshift-ci openshift-ci Bot requested review from subhashkhileri and zdrapela May 19, 2026 07:17
@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used
✅ Tickets: RHDHBUGS-2577

Grey Divider


Action required

1. Wrong psql target DB ✓ Resolved 🐞 Bug ≡ Correctness
Description
In the external-DB branch of the create-sonataflow-database Job, the updated psql invocations no
longer pass -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}, so psql will
connect to its default database instead of the documented existing database. This can cause the job
to fail (and the existence-check query to run in the wrong session context) even when the external
DB configuration is otherwise correct.
Code

charts/backstage/templates/sonataflows.yaml[R201-202]

Relevance

⭐⭐⭐ High

PR #173 explicitly accepted using -d {{ externalDBName }} for external DB psql in create-db Job.

PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The external-DB job’s new psql commands omit any -d selection, while the chart documentation
explicitly defines externalDBName as the existing database to connect to for external DB setups;
the job then creates the sonataflow database on that server.

charts/backstage/templates/sonataflows.yaml[198-208]
charts/backstage/README.md[420-434]
charts/backstage/values.yaml[524-532]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The external-DB create-db job runs `psql` without specifying a database (`-d ...`). The chart’s README states `externalDBName` is the *existing* DB to connect to when using an external server, while the job creates the `sonataflow` DB.

### Issue Context
Without `-d`, `psql` connects to its default DB (often derived from the username/PGDATABASE), which may not exist or may not be the intended maintenance DB, causing the create step and/or the follow-up `SELECT 1 FROM pg_database ...` check to fail incorrectly.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[201-202]

### Suggested change
In the external-DB branch, add `-d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}` (or a safe default like `postgres` if you prefer) to **both**:
- the `CREATE DATABASE sonataflow;` `psql` call
- the `SELECT 1 FROM pg_database ...` `psql` call

This aligns behavior with the README guidance and avoids relying on `psql` defaults.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Schema blocks new value ✓ Resolved 🐞 Bug ≡ Correctness
Description
values.yaml introduces orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit, but
values.schema.json has additionalProperties: false for sonataflowPlatform and does not define
this field, so schema validation/linting will fail and users cannot configure it.
Code

charts/backstage/values.yaml[R560-561]

Relevance

⭐⭐⭐ High

Schema/README kept in sync with new values in past (PRs #173/#225/#143); mismatch likely rejected by
lint.

PR-#173
PR-#225
PR-#143

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The default values introduce dbCreationJobBackoffLimit, but the schema explicitly forbids unknown
keys under sonataflowPlatform and does not list the new property; the README values table for the
same section also lacks the new key.

charts/backstage/values.yaml[548-565]
charts/backstage/templates/sonataflows.yaml[186-212]
charts/backstage/values.schema.json[557-680]
charts/backstage/README.md[208-226]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new Helm value `orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit` was added and referenced by templates, but the chart’s `values.schema.json` does not define it while `sonataflowPlatform` is schema-locked (`additionalProperties: false`). This causes Helm schema validation / CI lint to fail and prevents users from setting the value.

## Issue Context
- `values.yaml` now contains the new key with a default.
- `templates/sonataflows.yaml` renders it into Job `backoffLimit`.
- `values.schema.json` defines `orchestrator.sonataflowPlatform` with `additionalProperties: false`, and the allowed `properties` list does not include `dbCreationJobBackoffLimit`.
- The generated `README.md` values table also does not list the new key, indicating docs/schema regeneration wasn’t done.

## Fix Focus Areas
- charts/backstage/values.schema.json[557-680]
- charts/backstage/values.yaml[548-565]
- charts/backstage/README.md[208-226]

### Implementation notes
- Add `dbCreationJobBackoffLimit` under `orchestrator.sonataflowPlatform.properties` with type `integer` (and ideally `minimum: 0`) and default `2`.
- Re-run the repo’s doc/schema generation (pre-commit hook) so `README.md` includes the new value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. TTL triggers GitOps reruns 🐞 Bug ☼ Reliability ⭐ New
Description
The create-sf-db Job sets ttlSecondsAfterFinished even though it is a normal, continuously-managed
manifest (not a hook), so in GitOps setups (e.g., ArgoCD) the Job can be deleted by TTL and then
recreated/re-executed on the next reconcile. This can lead to recurring “database already exists”
executions (or repeated attempts after a transient fix) and unnecessary load/noise.
Code

charts/backstage/templates/sonataflows.yaml[91]

Evidence
The Job is rendered as a standard resource and includes ttlSecondsAfterFinished; when TTL deletes
the Job, a GitOps reconciler managing the chart will observe the missing resource and recreate it,
re-triggering execution. The repo also documents ArgoCD managing charts, making this scenario
relevant.

charts/backstage/templates/sonataflows.yaml[84-93]
charts/backstage/values.yaml[536-541]
charts/backstage/README.md[212-216]
charts/orchestrator-software-templates-infra/docs/GitopsOperator.md[32-35]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ttlSecondsAfterFinished` is always rendered for the create-sf-db Job. When the chart is managed by a continuous reconciler (e.g., ArgoCD), TTL deletion makes the Job drift from desired state, so it will be recreated and re-run.

### Issue Context
This chart repo explicitly documents ArgoCD/GitOps usage. A Job that is part of the desired manifests should generally not self-delete unless you also prevent the reconciler from recreating it.

### Fix Focus Areas
- Render TTL only when explicitly enabled (e.g., omit the field when the value is `null` or `0`, treating that as “disabled”).
- Update defaults/docs/schema to support disabling TTL (allow `null` in schema, or document a disable convention and implement it in the template).

Fix focus references:
- charts/backstage/templates/sonataflows.yaml[84-93]
- charts/backstage/values.yaml[536-541]
- charts/backstage/values.schema.json[478-496]
- charts/backstage/values.schema.tmpl.json[571-589]
- charts/backstage/README.md[212-216]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. PR-specific doc committed ✓ Resolved 🐞 Bug ⚙ Maintainability
Description
The newly added pr-info.md is a large, ticket-specific and date-stamped troubleshooting/testing
write-up, which is likely to become stale and create long-term maintenance noise in the repo. It
appears intended for PR context rather than evergreen project documentation.
Code

pr-info.md[R1-4]

Relevance

⭐⭐⭐ High

Team often removes PR-temporary content and prefers evergreen docs (accepted in PRs #165, #183,
#280).

PR-#165
PR-#183
PR-#280

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The file content is explicitly scoped to a single Jira ticket/PR and includes date-stamped test
results, which makes it likely to become outdated as chart behavior and versions evolve.

pr-info.md[1-4]
pr-info.md[464-474]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`pr-info.md` is being added to the repository root as a 700+ line, ticket-specific and date-stamped explanation/testing guide. This kind of PR-context artifact tends to become stale quickly and adds maintenance overhead for future contributors.

## Issue Context
The file includes a Jira ticket reference and dated test results, indicating it is not evergreen documentation.

## Fix Focus Areas
- pr-info.md[1-4]
- pr-info.md[464-474]

## Suggested fix
- Prefer moving this content to the PR description / Jira ticket / internal wiki, or
- If the content is valuable long-term, extract the evergreen parts into the appropriate project docs location (e.g., `docs/`), and drop dated/PR-specific sections (like the dated test results table).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Implicit grep dependency 🐞 Bug ☼ Reliability
Description
The create-sonataflow-database Job’s new failure-handling path pipes psql output to grep -q 1,
but the Job image is configurable via orchestrator.sonataflowPlatform.createDBJobImage and may not
include grep, causing the Job to fail even when psql is present and the database already exists.
Code

charts/backstage/templates/sonataflows.yaml[195]

Relevance

⭐⭐ Medium

No prior review pattern about avoiding grep in configurable Job images; related create-db Job
changes accepted in #164/#173.

PR-#164
PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job’s fallback existence check explicitly invokes grep, and the Job container image is
templated from a user-configurable value, so there is no guarantee grep exists in the runtime
image.

charts/backstage/templates/sonataflows.yaml[145-213]
charts/backstage/values.yaml[532-538]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The create-db Job uses `| grep -q 1` to determine whether the `sonataflow` database already exists. Because `createDBJobImage` is user-overridable, users may supply an image that includes `psql` but not `grep`, causing a hard failure in the fallback path.

## Issue Context
The Job image is driven by `.Values.orchestrator.sonataflowPlatform.createDBJobImage`, which is explicitly configurable in `values.yaml`.

## Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[190-212]
- charts/backstage/values.yaml[532-537]

## Suggested change
Replace the `psql ... | grep -q 1` existence test with a shell-only check that does not rely on `grep`.

Example pattern:
```sh
if db_exists="$(psql ... -Atc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" 2>/dev/null)"; then
 if [ "$db_exists" = "1" ]; then
   echo "Database 'sonataflow' already exists, skipping creation."
 else
   echo "ERROR: Failed to create database 'sonataflow'."
   exit 1
 fi
else
 echo "ERROR: Failed to verify whether database 'sonataflow' exists."
 exit 1
fi
```
Apply this in both the upstream and external DB branches.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
6. Job logs expire quickly ✓ Resolved 🐞 Bug ◔ Observability
Description
The create-sonataflow-database Job is now a Helm hook and also has ttlSecondsAfterFinished: 300,
so Kubernetes will garbage-collect the Job and its pods 5 minutes after it finishes (success or
failure), which can remove the primary failure evidence shortly after retries are exhausted. This is
especially risky now that the Job is expected to fail on real errors, because post-mortem debugging
may be time-boxed to a few minutes.
Code

charts/backstage/templates/sonataflows.yaml[R90-94]

Relevance

⭐⭐ Medium

No prior evidence on Job TTL/log retention; past sonataflows.yaml reviews focused on security/DB
wiring, not ttlSecondsAfterFinished.

PR-#166
PR-#173
PR-#235

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job is explicitly marked as a Helm hook and configured with a 300-second TTL, which instructs
Kubernetes to delete finished Jobs/pods after 5 minutes, limiting log availability after completion.

charts/backstage/templates/sonataflows.yaml[85-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ttlSecondsAfterFinished: 300` causes the DB creation Job and its pods to be deleted 5 minutes after completion (including failed completion), which can make it hard to retrieve logs/events when the hook fails after retries.

### Issue Context
This Job is a Helm hook (`post-install,post-upgrade`) and is now intended to fail on real errors; fast TTL cleanup reduces the time window to investigate failures.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[87-96]

### Suggested fix
- Prefer preserving failed Jobs for debugging:
 - Remove the hard-coded `ttlSecondsAfterFinished`, and instead set Helm hook deletion policy to clean up only on success (e.g., `before-hook-creation,hook-succeeded`).
 - Or make `ttlSecondsAfterFinished` configurable via values (and schema/README), with a safer default for troubleshooting (or document that centralized logging is expected).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 56b08b4

Results up to commit 17a64a7


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Schema blocks new value ✓ Resolved 🐞 Bug ≡ Correctness
Description
values.yaml introduces orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit, but
values.schema.json has additionalProperties: false for sonataflowPlatform and does not define
this field, so schema validation/linting will fail and users cannot configure it.
Code

charts/backstage/values.yaml[R560-561]

Relevance

⭐⭐⭐ High

Schema/README kept in sync with new values in past (PRs #173/#225/#143); mismatch likely rejected by
lint.

PR-#173
PR-#225
PR-#143

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The default values introduce dbCreationJobBackoffLimit, but the schema explicitly forbids unknown
keys under sonataflowPlatform and does not list the new property; the README values table for the
same section also lacks the new key.

charts/backstage/values.yaml[548-565]
charts/backstage/templates/sonataflows.yaml[186-212]
charts/backstage/values.schema.json[557-680]
charts/backstage/README.md[208-226]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new Helm value `orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit` was added and referenced by templates, but the chart’s `values.schema.json` does not define it while `sonataflowPlatform` is schema-locked (`additionalProperties: false`). This causes Helm schema validation / CI lint to fail and prevents users from setting the value.

## Issue Context
- `values.yaml` now contains the new key with a default.
- `templates/sonataflows.yaml` renders it into Job `backoffLimit`.
- `values.schema.json` defines `orchestrator.sonataflowPlatform` with `additionalProperties: false`, and the allowed `properties` list does not include `dbCreationJobBackoffLimit`.
- The generated `README.md` values table also does not list the new key, indicating docs/schema regeneration wasn’t done.

## Fix Focus Areas
- charts/backstage/values.schema.json[557-680]
- charts/backstage/values.yaml[548-565]
- charts/backstage/README.md[208-226]

### Implementation notes
- Add `dbCreationJobBackoffLimit` under `orchestrator.sonataflowPlatform.properties` with type `integer` (and ideally `minimum: 0`) and default `2`.
- Re-run the repo’s doc/schema generation (pre-commit hook) so `README.md` includes the new value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 635fd18


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Wrong psql target DB ✓ Resolved 🐞 Bug ≡ Correctness
Description
In the external-DB branch of the create-sonataflow-database Job, the updated psql invocations no
longer pass -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}, so psql will
connect to its default database instead of the documented existing database. This can cause the job
to fail (and the existence-check query to run in the wrong session context) even when the external
DB configuration is otherwise correct.
Code

charts/backstage/templates/sonataflows.yaml[R201-202]

Relevance

⭐⭐⭐ High

PR #173 explicitly accepted using -d {{ externalDBName }} for external DB psql in create-db Job.

PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The external-DB job’s new psql commands omit any -d selection, while the chart documentation
explicitly defines externalDBName as the existing database to connect to for external DB setups;
the job then creates the sonataflow database on that server.

charts/backstage/templates/sonataflows.yaml[198-208]
charts/backstage/README.md[420-434]
charts/backstage/values.yaml[524-532]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The external-DB create-db job runs `psql` without specifying a database (`-d ...`). The chart’s README states `externalDBName` is the *existing* DB to connect to when using an external server, while the job creates the `sonataflow` DB.

### Issue Context
Without `-d`, `psql` connects to its default DB (often derived from the username/PGDATABASE), which may not exist or may not be the intended maintenance DB, causing the create step and/or the follow-up `SELECT 1 FROM pg_database ...` check to fail incorrectly.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[201-202]

### Suggested change
In the external-DB branch, add `-d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}` (or a safe default like `postgres` if you prefer) to **both**:
- the `CREATE DATABASE sonataflow;` `psql` call
- the `SELECT 1 FROM pg_database ...` `psql` call

This aligns behavior with the README guidance and avoids relying on `psql` defaults.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 13b5e0f


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Remediation recommended
1. Job logs expire quickly ✓ Resolved 🐞 Bug ◔ Observability
Description
The create-sonataflow-database Job is now a Helm hook and also has ttlSecondsAfterFinished: 300,
so Kubernetes will garbage-collect the Job and its pods 5 minutes after it finishes (success or
failure), which can remove the primary failure evidence shortly after retries are exhausted. This is
especially risky now that the Job is expected to fail on real errors, because post-mortem debugging
may be time-boxed to a few minutes.
Code

charts/backstage/templates/sonataflows.yaml[R90-94]

Relevance

⭐⭐ Medium

No prior evidence on Job TTL/log retention; past sonataflows.yaml reviews focused on security/DB
wiring, not ttlSecondsAfterFinished.

PR-#166
PR-#173
PR-#235

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job is explicitly marked as a Helm hook and configured with a 300-second TTL, which instructs
Kubernetes to delete finished Jobs/pods after 5 minutes, limiting log availability after completion.

charts/backstage/templates/sonataflows.yaml[85-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ttlSecondsAfterFinished: 300` causes the DB creation Job and its pods to be deleted 5 minutes after completion (including failed completion), which can make it hard to retrieve logs/events when the hook fails after retries.

### Issue Context
This Job is a Helm hook (`post-install,post-upgrade`) and is now intended to fail on real errors; fast TTL cleanup reduces the time window to investigate failures.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[87-96]

### Suggested fix
- Prefer preserving failed Jobs for debugging:
 - Remove the hard-coded `ttlSecondsAfterFinished`, and instead set Helm hook deletion policy to clean up only on success (e.g., `before-hook-creation,hook-succeeded`).
 - Or make `ttlSecondsAfterFinished` configurable via values (and schema/README), with a safer default for troubleshooting (or document that centralized logging is expected).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 41fbb3e


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Remediation recommended
1. PR-specific doc committed ✓ Resolved 🐞 Bug ⚙ Maintainability
Description
The newly added pr-info.md is a large, ticket-specific and date-stamped troubleshooting/testing
write-up, which is likely to become stale and create long-term maintenance noise in the repo. It
appears intended for PR context rather than evergreen project documentation.
Code

pr-info.md[R1-4]

Relevance

⭐⭐⭐ High

Team often removes PR-temporary content and prefers evergreen docs (accepted in PRs #165, #183,
#280).

PR-#165
PR-#183
PR-#280

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The file content is explicitly scoped to a single Jira ticket/PR and includes date-stamped test
results, which makes it likely to become outdated as chart behavior and versions evolve.

pr-info.md[1-4]
pr-info.md[464-474]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`pr-info.md` is being added to the repository root as a 700+ line, ticket-specific and date-stamped explanation/testing guide. This kind of PR-context artifact tends to become stale quickly and adds maintenance overhead for future contributors.

## Issue Context
The file includes a Jira ticket reference and dated test results, indicating it is not evergreen documentation.

## Fix Focus Areas
- pr-info.md[1-4]
- pr-info.md[464-474]

## Suggested fix
- Prefer moving this content to the PR description / Jira ticket / internal wiki, or
- If the content is valuable long-term, extract the evergreen parts into the appropriate project docs location (e.g., `docs/`), and drop dated/PR-specific sections (like the dated test results table).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 333fd0c


🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)


Remediation recommended
1. Implicit grep dependency 🐞 Bug ☼ Reliability
Description
The create-sonataflow-database Job’s new failure-handling path pipes psql output to grep -q 1,
but the Job image is configurable via orchestrator.sonataflowPlatform.createDBJobImage and may not
include grep, causing the Job to fail even when psql is present and the database already exists.
Code

charts/backstage/templates/sonataflows.yaml[195]

Relevance

⭐⭐ Medium

No prior review pattern about avoiding grep in configurable Job images; related create-db Job
changes accepted in #164/#173.

PR-#164
PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job’s fallback existence check explicitly invokes grep, and the Job container image is
templated from a user-configurable value, so there is no guarantee grep exists in the runtime
image.

charts/backstage/templates/sonataflows.yaml[145-213]
charts/backstage/values.yaml[532-538]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The create-db Job uses `| grep -q 1` to determine whether the `sonataflow` database already exists. Because `createDBJobImage` is user-overridable, users may supply an image that includes `psql` but not `grep`, causing a hard failure in the fallback path.

## Issue Context
The Job image is driven by `.Values.orchestrator.sonataflowPlatform.createDBJobImage`, which is explicitly configurable in `values.yaml`.

## Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[190-212]
- charts/backstage/values.yaml[532-537]

## Suggested change
Replace the `psql ... | grep -q 1` existence test with a shell-only check that does not rely on `grep`.

Example pattern:
```sh
if db_exists="$(psql ... -Atc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" 2>/dev/null)"; then
 if [ "$db_exists" = "1" ]; then
   echo "Database 'sonataflow' already exists, skipping creation."
 else
   echo "ERROR: Failed to create database 'sonataflow'."
   exit 1
 fi
else
 echo "ERROR: Failed to verify whether database 'sonataflow' exists."
 exit 1
fi
```
Apply this in both the upstream and external DB branches.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@rhdh-qodo-merge
Copy link
Copy Markdown

Review Summary by Qodo

Fix DB creation job error handling and make backoffLimit configurable

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Replace blanket error suppression with proper database creation error handling
• Distinguish between "database already exists" (acceptable) and real failures (exit non-zero)
• Make job backoffLimit configurable via dbCreationJobBackoffLimit in values.yaml
• Enable Kubernetes to properly retry on actual database creation failures
Diagram
flowchart LR
  A["DB Creation Job"] -->|"Old: || echo WARNING"| B["Always Succeeds"]
  A -->|"New: Check if exists"| C{"Database exists?"}
  C -->|"Yes"| D["Skip & Succeed"]
  C -->|"No"| E["Exit 1"]
  E -->|"Retry via backoffLimit"| A
Loading

Grey Divider

File Changes

1. charts/backstage/templates/sonataflows.yaml 🐞 Bug fix +19/-3

Implement proper DB creation error handling

• Replaced || echo WARNING error suppression with conditional logic that checks if database
 already exists
• Added proper error handling that exits with code 1 on real failures while tolerating existing
 databases
• Updated both upstream PostgreSQL and external database code paths with identical error handling
 logic
• Changed hardcoded backoffLimit: 2 to use configurable `{{
 .Values.orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit }}`

charts/backstage/templates/sonataflows.yaml


2. charts/backstage/values.yaml ⚙️ Configuration changes +2/-0

Add configurable backoffLimit for DB creation job

• Added new configuration parameter dbCreationJobBackoffLimit with default value of 2
• Documented the parameter as "Number of retries for the create-db job if it fails"
• Allows users to customize retry behavior for database creation job failures

charts/backstage/values.yaml


Grey Divider

Qodo Logo

@rhdh-qodo-merge rhdh-qodo-merge Bot added enhancement New feature or request bug_fix labels May 19, 2026
…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
@rm3l
Copy link
Copy Markdown
Member

rm3l commented May 19, 2026

/cherry-pick release-1.10

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rm3l: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/review

@rhdh-qodo-merge
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Warning

/review is deprecated. Use /agentic_review instead (removal date not yet scheduled).

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Config Validation

backoffLimit is now driven by orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit; it would be good to ensure values are validated to be non-negative (and possibly set a sensible upper bound) in the schema to avoid invalid Job specs or runaway retries.

  backoffLimit: {{ .Values.orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit }}
{{- end }}
📚 Focus areas based on broader codebase context

Possible Issue

The external-DB branch runs CREATE DATABASE sonataflow; while connected to -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}. If that database is missing/unreachable (common in fresh setups), the psql connection can fail before the CREATE DATABASE is executed, and the follow-up existence check uses the same potentially-nonexistent DB, making error classification unreliable. (Ref 1)

          - |
            psql -h {{ .Release.Name }}-postgresql{{- if eq .Values.upstream.postgresql.architecture "replication" }}-primary{{- end }} -p 5432 -U postgres -c 'CREATE DATABASE sonataflow;' 2>&1 || {
              if psql -h {{ .Release.Name }}-postgresql{{- if eq .Values.upstream.postgresql.architecture "replication" }}-primary{{- end }} -p 5432 -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
                echo "Database 'sonataflow' already exists, skipping creation."
              else
                echo "ERROR: Failed to create database 'sonataflow'."
                exit 1
              fi
            }
{{- else }}
        args:
          - |
            psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -c 'CREATE DATABASE sonataflow;' 2>&1 || {
              if psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
                echo "Database 'sonataflow' already exists, skipping creation."
              else
                echo "ERROR: Failed to create database 'sonataflow'."
                exit 1
              fi
            }

Reference reasoning: The reference implementation connects to a known maintenance database (postgres) when checking for existence and creating the target DB, avoiding dependence on the target DB being present. Adopting that pattern (connect to postgres for both the check and CREATE DATABASE) makes the job’s retry/fail behavior deterministic across fresh installs and misconfigurations.

📄 References
  1. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [99-105]
  2. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [68-98]
  3. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [107-139]
  4. redhat-developer/rhdh-chart/charts/backstage/ci/with-orchestrator-values.yaml [1-21]
  5. redhat-developer/rhdh-chart/charts/backstage/values.yaml [502-509]
  6. redhat-developer/rhdh-chart/charts/backstage/values.yaml [188-203]
  7. redhat-developer/rhdh/scripts/rhdh-openshift-setup/values.yaml [222-236]
  8. redhat-developer/rhdh-operator/bundle/rhdh/manifests/rhdh.redhat.com_backstages.yaml [277-291]

…json and values.schema.tmpl.json with minimum and maximum constraints. Update sonataflows.yaml to simplify database creation command.

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Persistent review updated to latest commit 635fd18

…exists in PostgreSQL and matches the behavior of the internal branch

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Persistent review updated to latest commit 8f6b80f

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

cc/ @rm3l for review

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
…pgrade compatibility

The CI "Test Latest Release" check fails because helm upgrade tries to patch the existing Job's spec.template, which Kubernetes rejects as immutable. The old chart created the Job without ttlSecondsAfterFinished, so it persists indefinitely and blocks the upgrade. Adding helm.sh/hook and helm.sh/hook-delete-policy annotations makes Helm delete the old Job before creating the new one on upgrade.
…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 13b5e0f

Add hook-succeeded to the Helm hook delete policy so that successful. Jobs are cleaned up immediately while failed Jobs are kept for log   inspection. TTL still handles cleanup for ArgoCD users after 5 minutes.
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 41fbb3e

@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 333fd0c

Comment thread charts/backstage/templates/sonataflows.yaml Outdated
Comment thread charts/backstage/templates/sonataflows.yaml Outdated
Comment thread charts/backstage/Chart.yaml Outdated
Comment thread charts/backstage/templates/sonataflows.yaml Outdated
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu Fortune-Ndlovu requested a review from rm3l May 28, 2026 12:36
Comment thread charts/backstage/templates/sonataflows.yaml Outdated
Comment thread charts/backstage/templates/sonataflows.yaml Outdated
…ce names and replace dots with dashes, expose ttlSecondsAfterFinished and activeDeadlineSeconds in values.yaml so users can configure them.

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@sonarqubecloud
Copy link
Copy Markdown

@Fortune-Ndlovu Fortune-Ndlovu requested a review from rm3l May 29, 2026 10:59
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 29, 2026

Code review by qodo was updated up to the latest commit 56b08b4

@openshift-ci openshift-ci Bot added the lgtm label May 29, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 0654311 into redhat-developer:main May 29, 2026
6 checks passed
@openshift-cherrypick-robot
Copy link
Copy Markdown

@rm3l: #407 failed to apply on top of branch "release-1.10":

Applying: fix(orchestrator): fail DB creation job on actual errors instead of silently succeeding
Applying: bump chart
Using index info to reconstruct a base tree...
M	charts/backstage/Chart.yaml
Falling back to patching base and 3-way merge...
Auto-merging charts/backstage/Chart.yaml
CONFLICT (content): Merge conflict in charts/backstage/Chart.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0002 bump chart

Details

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants