Skip to content

Prototype: phased shutdown for MTP (#5345)#8580

Draft
Evangelink wants to merge 1 commit into
mainfrom
dev/amauryleve/mtp-phased-shutdown
Draft

Prototype: phased shutdown for MTP (#5345)#8580
Evangelink wants to merge 1 commit into
mainfrom
dev/amauryleve/mtp-phased-shutdown

Conversation

@Evangelink
Copy link
Copy Markdown
Member

Summary

Prototype of a two-phase graceful shutdown for Microsoft.Testing.Platform, exploring the design discussed in #5345 (see this comment).

Status: draft / RFC — opened for early design feedback. CLI options and most consumer migrations are intentionally deferred.

Motivation

Today MTP exposes a single ITestApplicationCancellationTokenSource.CancellationToken. On Ctrl+C, every consumer (tests, extensions, hosts) receives the same signal at the same time, and there is no contract distinguishing "please wind down gracefully" from "stop right now". This conflates the two shutdown modes that virtually every other host-style runtime treats as distinct phases (systemd EXTEND_TIMEOUT_USEC, macOS NSTerminateLater, AWS Lambda Extensions SHUTDOWN, Kubernetes preStop / terminationGracePeriodSeconds, docker compose 2-press SIGTERM→SIGKILL, Vitest teardownTimeout, …).

What's in this PR

A working prototype of Design A from the offline analysis:

Concept Implementation
Two cancellation tokens DrainingToken (graceful) and AbortingToken (forceful) on the internal ITestApplicationCancellationTokenSource
Back-compat Existing CancellationToken is preserved as an alias of DrainingToken so the ~14 current consumers keep working unchanged
Abort() API Explicit programmatic escalation entry point
State machine Running → Draining → Aborting using Interlocked.CompareExchange for atomic phase transitions
Ctrl+C UX (matches docker compose / kubectl / npm) 1st press → Draining (+ 30s grace timer); 2nd press → Aborting (+ 10s abort-timeout FailFast safety net); 3rd press → not intercepted, runtime kills the process
Auto-escalation Grace-period timer escalates Draining → Aborting if drain takes too long
Last-resort Abort-timeout uses IEnvironment.FailFast (banned Environment.FailFast replaced with the injectable wrapper)

Files

  • src/Platform/Microsoft.Testing.Platform/Services/ITestApplicationCancellationTokenSource.cs — extended interface
  • src/Platform/Microsoft.Testing.Platform/Services/CTRLPlusCCancellationTokenSource.cs — full rewrite
  • test/UnitTests/Microsoft.Testing.Platform.UnitTests/Hosts/CommonHostTests.cs — inline mock updated
  • test/UnitTests/Microsoft.Testing.Platform.UnitTests/Services/CTRLPlusCCancellationTokenSourceTests.cs — 6 new unit tests

Verification

  • Builds clean on net8.0 / net9.0 / netstandard2.0
  • 6/6 new tests pass on net9.0
  • 6/6 new tests pass on net462 (confirms no DIM/runtime issues for the MSTest.TestAdapter consumer of the netstandard2.0 build)
  • 24/24 adjacent CommonHost / Cancellation / StopPolic* tests still pass — no regressions

Intentionally deferred (follow-ups)

To keep this PR reviewable, the following are explicitly NOT in scope and will land as separate PRs once the core direction is approved:

  • CLI options --shutdown-grace-period and --shutdown-abort-timeout via PlatformCommandLineProvider + HelpInfoTests acceptance assertions
  • Environment-variable propagation to TestHostOrchestratorHost / TestHostControllersTestHost controlled children
  • TerminalOutputDevice UX line: "Cancelling test session… (press Ctrl+C again to force quit)"
  • Migrating existing token consumers from the legacy alias onto explicit DrainingToken / AbortingToken usage
  • Coordinating with the HotReload extension's own CancelKeyPress handler
  • Design B (IShutdownParticipant ack/extend protocol) — separate RFC

Open design questions

  1. Should defaults be 30s/10s or come from HostOptions.ShutdownTimeout + a new AbortTimeout?
  2. Should the IEnvironment injection be threaded explicitly from TestHostBuilder.CommonServices now, or stay defaulted to SystemEnvironment?
  3. Are we happy with letting the 3rd Ctrl+C reach the runtime, or should we still Process.Kill ourselves?

Full RFC (motivation, prior-art table, rollout plan, alternatives) is available in my session notes and will be posted as a comment on #5345 once the prototype direction is acknowledged.

Refs #5345

Introduce a two-phase shutdown model for Microsoft.Testing.Platform so
test sessions get a deterministic drain window before being aborted:

- Extend internal ITestApplicationCancellationTokenSource with
  DrainingToken (graceful cancel) and AbortingToken (forceful abort),
  plus an Abort() entry point. The existing CancellationToken is kept
  as a back-compat alias for DrainingToken so current consumers keep
  observing graceful cancellation without changes.
- Rewrite CTRLPlusCCancellationTokenSource with a Running -> Draining
  -> Aborting state machine:
    * 1st Ctrl+C enters Draining, starts a 30s grace timer that
      escalates to Aborting on elapse.
    * 2nd Ctrl+C escalates to Aborting immediately, starts a 10s
      abort-timeout safety net that FailFasts via IEnvironment.
    * 3rd Ctrl+C is no longer intercepted - the runtime terminates
      the process (matches docker compose / kubectl / npm UX).
- Add 6 unit tests covering initial state, single/idempotent cancel,
  abort, grace-period escalation, and zero-grace escalation. Passes
  on net9.0 and net462.

CLI options (--shutdown-grace-period, --shutdown-abort-timeout), env-
var propagation to controlled hosts, TerminalOutputDevice UX updates,
and migration of existing token consumers are intentionally deferred
to follow-up PRs and tracked in the RFC.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prototype implementation of a two-phase shutdown model for Microsoft.Testing.Platform (MTP), introducing distinct “Draining” vs “Aborting” cancellation semantics to address Ctrl+C behavior discussed in #5345 while keeping back-compat for existing token consumers.

Changes:

  • Extend ITestApplicationCancellationTokenSource with DrainingToken, AbortingToken, and an Abort() escalation API (with CancellationToken preserved as a Draining alias).
  • Rewrite CTRLPlusCCancellationTokenSource to implement a Running → Draining → Aborting phase model with Ctrl+C escalation and grace/abort timers.
  • Add unit tests covering the new phase/token behavior.
Show a summary per file
File Description
src/Platform/Microsoft.Testing.Platform/Services/ITestApplicationCancellationTokenSource.cs Adds two-phase shutdown tokens and Abort() API while preserving legacy token alias.
src/Platform/Microsoft.Testing.Platform/Services/CTRLPlusCCancellationTokenSource.cs Implements the phased shutdown state machine, Ctrl+C escalation logic, and timers.
test/UnitTests/Microsoft.Testing.Platform.UnitTests/Hosts/CommonHostTests.cs Updates local test stub to satisfy the extended cancellation token source interface.
test/UnitTests/Microsoft.Testing.Platform.UnitTests/Services/CTRLPlusCCancellationTokenSourceTests.cs Adds new unit tests validating draining/aborting semantics and grace escalation.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 5

case 2:
// 2nd Ctrl+C: escalate to abort.
e.Cancel = true;
EnterAborting();
Comment on lines +105 to +109
public void Dispose()
{
_drainingCts.Dispose();
_abortingCts.Dispose();
}
Comment on lines +163 to +173
private void EnterAborting()
{
if (Interlocked.Exchange(ref _phase, PhaseAborting) == PhaseAborting)
{
return;
}

public void Cancel()
=> _cancellationTokenSource.Cancel();
try
{
_abortingCts.Cancel();
}
Comment on lines +189 to +196
private static void ScheduleEscalation(TimeSpan delay, Action action)
{
// Fire-and-forget timer. We don't dispose: the host is shutting down anyway,
// and a short-lived CTS is cheaper than holding a Timer reference we'd need
// to manage across the phase machine.
var timerCts = new CancellationTokenSource(delay);
timerCts.Token.Register(action);
}
Comment on lines +88 to +92
while (!source.AbortingToken.IsCancellationRequested && !waitCts.IsCancellationRequested)
{
await Task.Delay(10, TestContext.CancellationToken).ConfigureAwait(false);
}

private readonly IEnvironment _environment;
private readonly ILogger? _logger;

private int _phase = PhaseRunning;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants