Skip to content

Parallelize SQL E2E and Integration tests#5584

Open
mikaelweave wants to merge 48 commits into
mainfrom
mikaelweave/e2e-parallelization
Open

Parallelize SQL E2E and Integration tests#5584
mikaelweave wants to merge 48 commits into
mainfrom
mikaelweave/e2e-parallelization

Conversation

@mikaelweave

@mikaelweave mikaelweave commented May 25, 2026

Copy link
Copy Markdown
Contributor

Description

  • Split PR E2E tests into smaller SQL/Cosmos shards so independent buckets can run in parallel.
  • Keep sensitive async/global-state tests in a terminal RequiresIsolation lane.
  • Split SQL/Cosmos integration tests into smaller shards and share the common integration test runner steps.
  • Keep E2E xUnit execution serial inside each shard; parallelism is only at the pipeline/job level.
  • Move BulkDelete back into normal SQL/Cosmos E2E shards now that the isolation issue has a separate fix.

Related issues

Addresses AB#49554.

Testing

  • Validated shard filters with local test discovery for SQL E2E and integration buckets.
  • Ran targeted R4 SQL integration SmartOnFhir tests locally after fixture reuse changes.
  • Iterated through ADO PR builds to verify pipeline expansion and fix YAML/runtime issues.
  • Current PR build validation is expected to confirm the final shard layout.

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version unchanged.
  • When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Skip (pipeline/test-only change)

mikaelweave and others added 9 commits May 25, 2026 06:52
Split SQL E2E PR validation into opt-in category shards while keeping the shared template backward-compatible by default. Reindex and bulk update remain isolated jobs for the experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Route custom search parameter tests through an explicit SQL class filter so the shard is not empty, exclude those tests from the reindex shard, and serialize BulkDelete tests within an xUnit collection after the first experiment exposed concurrent 409 conflicts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add assembly-level xUnit collection behavior to match the E2E runner configuration intent and test whether BulkDelete conflicts are caused by in-job parallel execution under MTP.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move BulkDelete out of the parallel shard set and run it after the other SQL E2E shards plus Reindex and BulkUpdate to test whether its 409 conflicts are caused by concurrent async operation jobs in the same environment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an opt-in SQL integration sharding experiment for PR SQL stages. The shards split SQL integration tests into General, SmartSearch, Search, Reindex, and SqlSpecific buckets while preserving default behavior for other callers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log SMART integration fixture setup, resource upsert, and search-service timings so the PR pipeline can identify whether the slow SmartSearch shard is dominated by setup or specific query patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Seed SMART integration data once per version and data store so SmartOnFhir tests avoid rebuilding the expensive storage fixture for every test method. Use unique IDs for mutation tests while preserving update semantics, and dispose cached fixtures when the test process exits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mark the 80K surrogate-id import stress test as ImportLongRunning and exclude that category from the default SQL E2E PR filter. CI explicitly passes its main category filter, so the long import coverage remains in CI while PR validation avoids the 20+ minute stress case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the sharded SQL E2E Reindex job wait for the parallel E2E shards and BulkUpdate before it starts. BulkDelete remains terminal after Reindex, preserving its existing isolation while letting the next PR run test whether Reindex flakiness is caused by concurrent shard activity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mikaelweave

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

mikaelweave and others added 3 commits May 27, 2026 17:25
Add a RequiresIsolation test category and tag SQL E2E classes that exercise global async operations: BulkDelete, BulkUpdate, and Reindex. Standard sharded E2E jobs exclude RequiresIsolation while a single terminal RequiresIsolation lane runs after the parallel shards.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only materialize the SmartSearch SQL integration shard for R4 and R4B so Stu3 and R5 do not run all-skipped SMART tests and fail with zero executed tests. The RequiresIsolation E2E lane remains dependent only on E2E shard jobs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split AAD cleanup/setup into a separate PR stage so application deployment no longer waits on AAD preparation. Test stages still depend on setupAad, and KeyVault NSP association waits for setupAad before using the test auth key vault. Also run the RequiresIsolation E2E lane in parallel for the next perf experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov-commenter

codecov-commenter commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.92%. Comparing base (9c3462d) to head (261dea8).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5584      +/-   ##
==========================================
- Coverage   77.59%   76.92%   -0.67%     
==========================================
  Files         997      997              
  Lines       36559    36598      +39     
  Branches     5549     5529      -20     
==========================================
- Hits        28368    28154     -214     
- Misses       6816     7080     +264     
+ Partials     1375     1364      -11     

see 35 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mikaelweave and others added 3 commits May 27, 2026 20:58
Add integration-test xUnit runner settings that enable class-level collection parallelization, collapse PR SQL integration tests back to one job per version, and publish class-duration reports from SQL and Cosmos integration TRX output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Avoid publishing a second root runner config by writing the integration xUnit runner settings into extracted integration test binaries in the pipeline. Keep the class-duration reporting and single-job integration experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a pipeline-controlled E2E xUnit runner override and class-duration reporting for E2E jobs. PR SQL and Cosmos test stages opt into class-level test collection parallelization while CI retains the serial default.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mikaelweave

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines failed to run 1 pipeline(s).

mikaelweave and others added 2 commits May 28, 2026 06:34
Use a double-quoted artifact name so the replace expression's single-quoted arguments do not break YAML parsing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the RequiresIsolation experiment and return SQL E2E isolation to feature-category jobs: BulkUpdate, Reindex, and terminal BulkDelete. Re-enable PR SQL integration category sharding while keeping the E2E class-parallel experiment pipeline-controlled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mikaelweave mikaelweave changed the title Experiment with sharded SQL E2E PR tests [DRAFT] [Do not merge] Experiment with sharded SQL E2E PR tests May 28, 2026
mikaelweave and others added 7 commits May 28, 2026 10:25
Reintroduce RequiresIsolation for BulkDelete, BulkUpdate, and Reindex E2E tests. Regular SQL and Cosmos E2E lanes exclude those tests when class-level parallelism is enabled, and isolated lanes run after the broad E2E lanes to avoid async operation conflicts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set succeededOrFailed conditions on SQL and Cosmos RequiresIsolation E2E jobs so they still run after dependent parallel E2E shards fail, giving complete isolation-lane signal in PR experiments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add optional Cosmos integration and E2E category sharding to match the SQL experiment. PR Cosmos stages now run integration and E2E category shards in parallel, keep SMART integration gated to R4/R4B, and run BulkDelete/Reindex in a RequiresIsolation E2E lane after the broad shards.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each SQL and Cosmos integration job (monolithic and sharded) previously
rebuilt the integration test projects via a redundant 'Build Integration
Test Projects' step, then ran 'dotnet test --no-build' against the
source-workspace build - ignoring the already-published artifacts
extracted into $(Agent.TempDirectory)/IntegrationTests (and the
xunit.runner.json class-parallelism config written into them).

Remove the per-job rebuild and instead run the extracted, pre-built test
DLL directly (mirroring e2e-tests.yml), passing the coverage and TRX
arguments. This eliminates duplicate compilation across parallel shards
and makes the class-parallelism config actually apply.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CosmosSpecific shard filtered only on FullyQualifiedName~CosmosDb.
That substring matches both Cosmos-only test classes and the "(CosmosDb)"
data-store suffix that the custom xUnit framework appends to every
DataStore.All test. As a result, Reindex (IndexAndReindex), Smart
(SmartOnFhir) and Search/SearchParameterStatus (Search) Cosmos variants
ran in BOTH their dedicated shard and CosmosSpecific - executing twice
and inflating CosmosSpecific.

Add the same category exclusions the SQL SqlSpecific shard already uses
so each categorized test runs only in its dedicated shard. CosmosSpecific
remains the catch-all for the remaining Cosmos variants (Export,
DataSourceValidation, uncategorized persistence tests, Cosmos-only
classes).

The Search and Reindex shards are never gated, so those categories stay
covered for every version; SmartOnFhir has no integration tests for
Stu3/R5 (only the R4/R4B-gated SmartSearch shard), so no tests are
dropped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Security analysis only consumes the deploy and nuget artifacts produced by
BuildArtifacts and runs its own RoslynAnalyzers build; it never uses unit-test
output. Removing the BuildUnitTests dependency lets the stage start as soon as
BuildArtifacts completes (about 14 minutes earlier in build 49614) and prevents
it from becoming the pipeline tail as the test stages get faster.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SQL sharded integration job was the only test job variant missing the
'duration by class' report and its published artifact. SQL monolithic, Cosmos
monolithic, and Cosmos sharded all already emit one. This adds the matching
Write-TestClassDurationReport.ps1 step and artifact publish (keyed by shard
name) so per-class timing data is available for every integration shard,
matching the E2E and Cosmos jobs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

mikaelweave and others added 2 commits June 8, 2026 10:27
The sharded SQL integration job (SqlIntegrationTests_<shard>) was missing all
of its setup steps between the checkout and the test run: integration-setup,
integration-tests-extract, the parallelism config task, AzureKeyVault, the
Set Workload Identity Variables step, and the 'Run SQL Integration Tests' task
header (displayName/inputs/script: |). The orphaned test script body sat
directly after 'path: source', so YAML folded those lines into the checkout
step's 'path' plain scalar. That produced the confusing checkout failure:
'The path /mnt/vss/_work/1/source\n\Continue... is too long'.

Restored the missing steps to match the monolithic SQL integration job and the
Cosmos sharded integration job so each SQL shard checks out, extracts the
prebuilt integration test binaries, configures parallelism, sets workload
identity variables, and runs its filtered test slice.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The current PR run (build 49780) is failing multiple sharded E2E jobs at
'Publish test results' even when the E2E execution task succeeds with
--retry-failed-tests. Those jobs were missing the
AllowPtrToDetectTestRunRetryFiles variable, so PTR treated intermediate retry
TRX failures as final failures.

Add AllowPtrToDetectTestRunRetryFiles: true to:
- SQL sharded E2E jobs
- SQL RequiresIsolation E2E lane
- Cosmos monolithic E2E job
- Cosmos sharded E2E jobs
- Cosmos RequiresIsolation E2E lane

This aligns E2E jobs with existing SQL monolithic/Reindex patterns and allows
retry artifacts to be interpreted correctly during PublishTestResults.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mikaelweave mikaelweave added this to the FY26\Q4\2Wk\2Wk24 milestone Jun 8, 2026
@mikaelweave mikaelweave added Enhancement Enhancement on existing functionality. Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change No-ADR ADR not needed labels Jun 8, 2026
@mikaelweave mikaelweave changed the title Parallelize E2E / Integration Tests (and add Isolated Test After) Parallelize SQL E2E and Integration tests Jun 8, 2026
mikaelweave and others added 12 commits June 8, 2026 12:23
Make e2e-tests.yml fail fast when testFilter is missing instead of silently
composing a fallback from appServiceType/categoryFilter. This makes pipeline
misconfiguration explicit and prevents accidental broad test selection.

To keep behavior deterministic, update SQL and Cosmos E2E callers to always
pass an explicit testFilter for monolithic, sharded, reindex, bulk update,
and requires-isolation lanes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Simplify E2E shard wiring by defining full shard filters in e2eTestShards and
passing shard.testFilter directly to e2e-tests.yml. This removes categoryFilter
fields from shard definitions and eliminates runtime composition of
mainCategoryFilter + shard.categoryFilter at call sites.

This keeps behavior explicit and aligns with fail-fast filter validation:
misconfigured shards now fail due to missing/invalid explicit filters instead
of being silently recomposed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ADO does not permit template expressions like ${{ parameters.x }}
inside the default: block of type:object parameters. Inline the
mainCategoryFilter default value directly into each shard's
testFilter string to resolve the pipeline validation error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CustomSearch shard was filtering on Category=CustomSearch and
FullyQualifiedName~CustomSearchParamTests, but CustomSearchParamTests
is tagged Search/IndexAndReindex. That made the shard match zero tests
and fail with dotnet test exit code 8 in all SQL stages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TokenOverflowTests.cs (the only E2E class tagged Category=CustomSearch) was
present in the directory but not listed in the shared .projitems file, so it
was never compiled into the E2E test DLL. This caused the CustomSearch shard
to always get zero tests and fail with exit code 8.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use WaitForReindexStatus result directly instead of non-existent
CheckReindexAsync API on TestFhirClient.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Point SQL CustomSearch E2E shard at CustomSearchParamTests
(Category=Search + Category=IndexAndReindex) and remove
TokenOverflowTests from shared E2E projitems so the shard no
longer runs the flaky/legacy custom-search overflow test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s-run.yml

Replace ~80-line duplicated job bodies (enable parallelism, KeyVault,
set workload identity, run tests, publish results, duration/coverage
artifacts) in both run-sql-tests.yml and run-cosmos-tests.yml with a
shared integration-tests-run.yml step template, mirroring the existing
e2e-tests.yml pattern.

SQL vs Cosmos differences handled via a dataStore parameter:
- SQL: simple Set Workload Identity Variables (env vars only)
- Cosmos: also looks up the container app for DataStoreResourceId

Net: -294 lines (497 removed, 203 added across 3 files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quote UniqueResourceGroupName before using it in the cleanup call so unresolved pipeline-variable text is treated as plain string and not executed by PowerShell.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prevent unresolved pipeline variable macros from being executed as PowerShell
subexpressions by using single-quoted assignment and reusing the local
resourceGroupName variable in create/log calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the dynamic xunit.runner.json override and e2eParallelizeTestCollections plumbing. Pipeline-level sharding with RequiresIsolation exclusions is the safer parallelization boundary for E2E tests, while each shard keeps the repository's serial xUnit runner behavior.

Revert the TokenOverflowTests.cs local change back to main because the file is not included in the shared E2E projitems and should stay out of this experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove RequiresIsolation from BulkDeleteTests now that the team has a fix for its isolation issue. Add explicit SQL and Cosmos BulkDelete E2E shards so sharded PR validation still runs these tests outside the terminal isolation lane.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mikaelweave mikaelweave marked this pull request as ready for review June 9, 2026 20:20
@mikaelweave mikaelweave requested a review from a team as a code owner June 9, 2026 20:20
mikaelweave and others added 7 commits June 9, 2026 13:35
Add guardrails and documentation around integration test runner configuration, E2E isolation lanes, duration reporting, and security analysis dependencies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set PublishTestResults to fail when integration test results include failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return integration tests to the DotNetCoreCLI project-based dotnet test path so MTP retry-aware PublishTestResults can merge retry TRX files correctly when coverage is enabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Deploy a dedicated R4 SQL container app bound to a separate FHIRR4Iso
database on the same app SQL server. The sqlE2eTests_RequiresIsolation
lane now targets this isolation endpoint with dependsOn: [] so it runs
concurrently with the parallel E2E shards instead of in a serialized
terminal lane that waits for every shard.

- fhir-sql.bicep: parameterize sqlDatabaseName (default preserves FHIR<ver>)
- Provision-AcaDeploy.ps1: add -SqlDatabaseName, pass through to template
- provision-deploy.yml: add sqlDatabaseName pass-through param
- build-variables.yml: add R4 SQL iso container app + key vault names
- pr-pipeline.yml: add deployR4SqlIso stage; testR4Sql depends on it
- run-sql-tests.yml: rewire isolation lane to the iso endpoint when
  isolationContainerAppName is supplied; otherwise keep terminal-lane
  fallback (Stu3/R5 unchanged for now)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The deployR4SqlIso stage failed bicep validation because KeyVaultNameR4SqlIso
derived '<base>-r4-sql-iso' (25 chars), over Azure Key Vault's 24-char limit.
Use a compact 'sqi' (sql-iso) token so the iso vault matches the proven-safe
length of the main R4 SQL vault ('<base>-r4-sqi', 21 chars) while staying
distinct (each ACA app provisions its own vault).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the validated R4 SQL dual-endpoint isolation pattern (build 49872, green)
to the Stu3 and R5 SQL stages. Each adds a dedicated container app + dedicated
database (FHIRStu3Iso / FHIRR5Iso) on the shared app SQL server with its own key
vault, so the RequiresIsolation E2E lane runs concurrently with the parallel
shards instead of in a serialized terminal lane.

- build-variables.yml: add DeploymentEnvironmentNameSqlIso / R5SqlIso and
  KeyVaultNameSqlIso / R5SqlIso (compact 'sqi' token to stay under the 24-char
  Key Vault limit: base-sqi=base+4, base-r5-sqi=base+7).
- pr-pipeline.yml: add deployStu3SqlIso / deployR5SqlIso stages; wire each into
  its test stage dependsOn + isolationContainerAppName.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n E2E tests

Mirror the proven SQL isolation pattern for Cosmos: deploy a dedicated
'-ciso' container app (own Cosmos account) per Cosmos stage so the
cosmosE2eTests_RequiresIsolation lane runs concurrently with the parallel
shards instead of serializing behind them.

- build-variables.yml: add DeploymentEnvironmentName{,R4}CosmosIso +
  KeyVaultName{Cosmos,R4Cosmos}Iso (key vault names <=24 chars).
- run-cosmos-tests.yml: add isolationContainerAppName param; when set, the
  isolation job runs with dependsOn: [] against the iso endpoint; else falls
  back to the serialized terminal lane (unchanged behavior).
- pr-pipeline.yml: add deployStu3IsoCosmos + deployR4IsoCosmos stages and wire
  testStu3Cosmos + testR4Cosmos to depend on them and pass the iso app name.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Enhancement Enhancement on existing functionality. No-ADR ADR not needed No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants