fix(ssrs): streamline test connection, stream reports, retry transient failures#27637
fix(ssrs): streamline test connection, stream reports, retry transient failures#27637
Conversation
…sient failures - Test connection's GetDashboards step no longer paginates every report; it now issues a single $top=1 probe to /Reports, matching the existing CheckAccess probe against /Folders. - SsrsClient.get_reports is now a generator that yields each page as it is fetched, so ingestion memory stays flat regardless of report count. - Session retries transient failures (connect errors, read timeouts, 5xx) up to 2 times with exponential backoff to survive flaky SSRS endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves SSRS ingestion reliability and scalability by making the SSRS “reports” listing lazy/streamed, reducing test-connection cost, and adding automatic retries for transient HTTP failures.
Changes:
- Replace SSRS connection test “GetDashboards” step from a full
/Reportspagination to a$top=1probe viaSsrsClient.test_get_reports. - Convert
SsrsClient.get_reports()to stream results page-by-page (iterator) and updateSsrsSource.get_dashboards_list()to yield visible reports lazily. - Add
requestssession retry behavior (urllib3RetryviaHTTPAdapter) and extend unit/integration tests accordingly.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ingestion/src/metadata/ingestion/source/dashboard/ssrs/client.py | Adds session-level retries and changes get_reports() to a streaming iterator; introduces test_get_reports() probe. |
| ingestion/src/metadata/ingestion/source/dashboard/ssrs/metadata.py | Changes dashboard listing to yield non-hidden reports lazily from the client iterator. |
| ingestion/src/metadata/ingestion/source/dashboard/ssrs/connection.py | Updates test-connection step mapping to use the lightweight test_get_reports() probe. |
| ingestion/tests/unit/topology/dashboard/test_ssrs.py | Updates tests for iterator-based APIs and adds a laziness test verifying only the first page is fetched initially. |
| ingestion/tests/integration/ssrs/test_metadata.py | Updates integration tests to materialize the iterator with list(...). |
| ingestion/tests/integration/connections/test_ssrs_connection.py | Adds integration coverage for test_get_reports() and for retry behavior using a flaky HTTP handler. |
| if verify_ssl is not None: | ||
| self.session.verify = verify_ssl | ||
| retry = Retry( | ||
| total=MAX_RETRIES, |
There was a problem hiding this comment.
Are we sure retries is hte reason we are getting timed out?
Code Review ✅ ApprovedStreamlines SSRS test connections, implements report streaming, and adds retry logic for transient failures. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| CONNECT_TIMEOUT = 10 | ||
| READ_TIMEOUT = 120 | ||
| PAGE_SIZE = 100 | ||
| MAX_RETRIES = 2 | ||
| BACKOFF_FACTOR = 1 | ||
| RETRY_STATUS_CODES = (500, 502, 503, 504) |
There was a problem hiding this comment.
CONNECT_TIMEOUT/READ_TIMEOUT combined with MAX_RETRIES can exceed the 3-minute test-connection timeout used by SSRS (THREE_MIN=180s). Worst case for a single GET is roughly (CONNECT_TIMEOUT+READ_TIMEOUT)*(MAX_RETRIES+1) plus backoff, which can cause test_get_reports / connection testing to time out even though retries are still in progress. Consider lowering the probe timeouts (or making _get accept an override timeout for test probes), reducing retries for test probes, or increasing the SSRS test-connection timeout to keep the worst-case under the configured limit.
|
🟡 Playwright Results — all passed (10 flaky)✅ 3701 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 89 skipped
🟡 10 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
…t failures (#27637) * fix(ssrs): streamline test connection, stream reports, and retry transient failures - Test connection's GetDashboards step no longer paginates every report; it now issues a single $top=1 probe to /Reports, matching the existing CheckAccess probe against /Folders. - SsrsClient.get_reports is now a generator that yields each page as it is fetched, so ingestion memory stays flat regardless of report count. - Session retries transient failures (connect errors, read timeouts, 5xx) up to 2 times with exponential backoff to survive flaky SSRS endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fixing handling of pagination * fix test --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
…t failures (#27637) * fix(ssrs): streamline test connection, stream reports, and retry transient failures - Test connection's GetDashboards step no longer paginates every report; it now issues a single $top=1 probe to /Reports, matching the existing CheckAccess probe against /Folders. - SsrsClient.get_reports is now a generator that yields each page as it is fetched, so ingestion memory stays flat regardless of report count. - Session retries transient failures (connect errors, read timeouts, 5xx) up to 2 times with exponential backoff to survive flaky SSRS endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fixing handling of pagination * fix test --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
…t failures (#27637) * fix(ssrs): streamline test connection, stream reports, and retry transient failures - Test connection's GetDashboards step no longer paginates every report; it now issues a single $top=1 probe to /Reports, matching the existing CheckAccess probe against /Folders. - SsrsClient.get_reports is now a generator that yields each page as it is fetched, so ingestion memory stays flat regardless of report count. - Session retries transient failures (connect errors, read timeouts, 5xx) up to 2 times with exponential backoff to survive flaky SSRS endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fixing handling of pagination * fix test --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
…t failures (open-metadata#27637) * fix(ssrs): streamline test connection, stream reports, and retry transient failures - Test connection's GetDashboards step no longer paginates every report; it now issues a single $top=1 probe to /Reports, matching the existing CheckAccess probe against /Folders. - SsrsClient.get_reports is now a generator that yields each page as it is fetched, so ingestion memory stays flat regardless of report count. - Session retries transient failures (connect errors, read timeouts, 5xx) up to 2 times with exponential backoff to survive flaky SSRS endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fixing handling of pagination * fix test --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>


Summary
GetDashboardsstep was callingclient.get_reports()which walked every page of/Reports— on a large SSRS instance that's a huge, unnecessary cost. It now issues a single$top=1probe (client.test_get_reports), mirroring the existingCheckAccessprobe against/Folders.SsrsClient.get_reportsnow streams. Converted from list-accumulating to anIterator[SsrsReport]thatyield froms each page as it's fetched, andSsrsSource.get_dashboards_listyields non-hidden reports one at a time. Ingestion memory stays flat regardless of report count.HTTPAdapterwith aurllib3Retry(2 retries, exponential backoff, covers connect errors, read timeouts, 5xx). Motivated by a production trace where a single 30sReadTimeoutagainst/Reportsaborted the fetch and left the pipeline reporting success with 0 records.Test plan
tests/unit/topology/dashboard/test_ssrs.py) — 22 pass, includes newtest_get_reports_streams_lazilyasserting the generator only fetches the first page until the consumer advances.tests/integration/ssrs/,tests/integration/connections/test_ssrs_connection.py) — 10 pass, includes newtest_get_connection_test_get_reports,test_connection_bad_host_get_reports, andtest_get_reports_retries_transient_failures(flaky mock server returns 503 twice then 200; asserts exactly 3 requests).🤖 Generated with Claude Code