From 94910767154e7c5d8a0952c2f34eab7b1593bf6f Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 1 May 2026 13:59:55 -0700 Subject: [PATCH 1/7] Explain auto-failover --- docs/cloud/high-availability/failovers.mdx | 98 +++++++++++----------- 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/docs/cloud/high-availability/failovers.mdx b/docs/cloud/high-availability/failovers.mdx index 5362f33348..daf47620bc 100644 --- a/docs/cloud/high-availability/failovers.mdx +++ b/docs/cloud/high-availability/failovers.mdx @@ -22,36 +22,56 @@ keywords: import { ToolTipTerm, DiscoverableDisclosure, CaptionedImage } from '@site/src/components'; In case of an incident or an outage, Temporal will automatically your Namespace from the primary to the replica. -This lets Workflow Executions continue with minimal interruptions or data loss. -You can also [manually initiate failovers](/cloud/high-availability/failovers) based on your situational monitoring or for testing. +This lets in-flight Workflow Executions continue, new Workflow Executions start, and closed Workflow Executions be inspected, all with minimal interruptions or data loss. +You can also [manually trigger a failover](/cloud/high-availability/failovers) based on your own monitoring or for failover testing. Returning control from the replica to the primary is called a . After a Temporal-managed failover, Temporal automatically fails back to the original region once it is healthy. See [Returning to the primary with failbacks](#failbacks) for details on automatic and manual failback options. -## Failovers +## Automatic failover -Occasionally, a Namespace may become temporarily unavailable due to an unexpected incident. -Temporal Cloud detects these issues using regular health checks. +When an unexpected outage hits your Temporal Namespace, failing over to a healthy cloud region can prevent data loss and application interruptions. +After a failover, in-flight Workflows continue, new Workflows start, and closed Workflows can be inspected, even while the Namespace's original region is unhealthy. -### Health checks +Temporal Cloud offers managed outage detection and failover to all Namespaces that use High Availability. +Temporal-managed failovers, also known as "automatic failovers," keep your Temporal Cloud Namespace available without manual intervention from you. +We aim to both detect the outage and complete a Temporal-managed failover in minutes from when the outage began, according to our stated [Recovery Time Objective (RTO)](/cloud/rto-rpo). -Temporal Cloud monitors error rates, latencies, and infrastructure problems, such as request timeouts. -If it finds unhealthy conditions where indicators exceed the allowed thresholds, Temporal automatically switches the primary to the replica. -In most cases, the replica is unaffected by the issue. -This process is known as failover. +After a Temporal-managed failover, your Namespace will have a replica in its original region. +Once the original region is healthy again, Temporal Cloud automatically performs a [failback](#failbacks), moving your Namespace back home. -### Automatic failovers + -Failovers prevent data loss and application interruptions. -Existing Workflows continue, and new Workflows start as the incident is addressed. +To opt-out of Temporal-managed failovers and its RTO, you can [disable automated failovers](/cloud/high-availability/failovers#disabling-temporal-initiated). +lth checks. -Temporal Cloud handles failovers automatically, ensuring continuity without manual intervention. -Once the incident is resolved, Temporal Cloud automatically performs a [failback](#failbacks), shifting Workflow Execution processing back to the original region. +### Conditions that trigger an automatic failover - +While the failover operation usually completes in seconds, it can take much longer in an outage to actually detect the disruption and decide to trigger a failover. + +To achieve Temporal Cloud's Recovery Time Objective (RTO) for Namespaces that have enabled High Availability and Temporal-managed failovers (also known as "automatic failovers"), Temporal Cloud runs automated Workflows that detect outages and trigger failovers. +These Workflows continuously monitor the health of Temporal Cloud in every region and every cell. + +The main conditions these Workflows check are listed below. +If any of these conditions are failing for too long, Temporal Cloud automatically triggers a failover on any Namespaces with High Availability that have a healthy replica. +Additionally, Temporal's on-call engineers may trigger a failover at their discretion, for example, if they see early signs of a regional outage. + +:::note + +The following list is meant to give Temporal Cloud users a general idea of the conditions that trigger a Temporal-managed failover. +This is not an exhaustive list of all cases, and it may change over time. + +::: + +#### Example conditions monitored + +1. Whether Temporal Cloud's services in the cell are reachable from the control plane. Unreachable services are considered "unhealthy". +2. The average latency of inbound RPC calls (excluding long-polling APIs) to Temporal services in the cell. If the average latency rises too high over a rolling time window, this condition is considered "unhealthy". +3. The percentage of inbound RPC calls that returned errors related to server health. If the percentage rises too high over a rolling time window, this condition is considered "unhealthy". +4. The average latency of calls from Temporal Cloud's services in the cell to its persistence layer. If the average latency rises too high over a rolling time window, this condition is considered "unhealthy". +5. The percentage of recent calls to the persistence layer that returned errors related to persistence health. If the percentage rises too high over a rolling time window, this condition is considered "unhealthy". -For more control over the failover process, you can [disable automated failovers](/cloud/high-availability/failovers#disabling-temporal-initiated). :::tip @@ -72,15 +92,15 @@ After failover, be aware of the following points: ### The failover process {#failover-process} -Temporal's automated failover process works as follows: +Temporal's failover process works as follows: -- During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync. -- If the primary becomes unavailable, Temporal detects the issue through health checks. - It automatically switches to the replica, using one of its available [failover scenarios](#scenarios). -- The replica takes over the active role and becomes the primary. +1. During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync. +1. A failover is triggered, either automatically by Temporal or manually by a user. +1. The replica takes over and the Namespace becomes active in the replica's cloud region. Operations continue with minimal disruption. -- When the original primary recovers, the roles can either switch back (failback, by default) or remain as they are, based on your Namespace settings. - Automatic role switching with failover and failback minimizes downtime for consistent availability. +1. If the failover was triggered by Temporal, when the original primary region recovers, Temporal triggers another failover to fail back to the Namespace's original region. + (It is possible to opt-out of this automatic fail back) +1. If the failover was triggered by a user, then the Namespace will continue as-is until a user triggers another failover. :::info @@ -91,9 +111,11 @@ This update is replicated through the Namespace metadata mechanism. ## Failover scenarios {#scenarios} -The Temporal Cloud failover mechanism supports several modes for executing Namespace failovers. -These modes include graceful failover ("handover"), forced failover, and a hybrid mode. -The hybrid mode is Temporal Cloud’s default Namespace behavior. +A failover on Temporal Cloud always executes in a "hybrid" fashion: +1. It first attempts a "graceful" failover +2. If the graceful failover does not complete after 10 seconds, then it triggers a "forced" failover. + +This strategy balances _consistency_ and _availability_ requirements. ### Graceful failover (handover) {#graceful-failover} @@ -122,28 +144,6 @@ Events not replicated due to replication lag undergo conflict resolution upon re This mode prioritizes _availability_ over consistency. -### Hybrid failover mode {#hybrid-failover} - -While graceful failovers are preferred for consistency, they aren’t always practical. -Temporal Cloud’s hybrid failover mode (the default mode) limits the initial graceful failover attempt to 10 seconds or less. - -During this period: - -- Existing Workflows stop progress. -- Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. - -If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. - -This strategy balances _consistency_ and _availability_ requirements. - -### Scenario summary - -| Failover Scenario | Characteristics | -| ---------------------------- | ------------------------------------------------------- | -| Graceful failover (handover) | Favors _consistency_ over availability. | -| Forced failover | Prioritizes _availability_ over consistency. | -| Hybrid failover mode | Balances _consistency_ and _availability_ requirements. | - ## Network partitions At any time only the primary or the replica is active. From f99253c9bcee851f814d64d92554af6fbdcef470 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Tue, 12 May 2026 11:42:07 -0700 Subject: [PATCH 2/7] Explaining different outage types --- docs/cloud/rto-rpo.mdx | 165 +++++++++++++++++++++++++++++++++-------- 1 file changed, 134 insertions(+), 31 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index ecb12e35c8..f65ec3700e 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -1,8 +1,8 @@ --- id: rpo-rto -title: RPO and RTO -sidebar_label: RPO and RTO -description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. +title: Outages and Recovery Objectives (RTO / RPO) +sidebar_label: Outages and Recovery Objectives (RTO / RPO) +description: Understand the types of outages Temporal Cloud is designed to handle, and the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for each. slug: /cloud/rpo-rto toc_max_heading_level: 4 keywords: @@ -11,6 +11,7 @@ keywords: - RTO - Recovery Point Objective - Recovery Time Objective + - outages tags: - Recovery Point Objective - Recovery Time Objective @@ -23,30 +24,114 @@ When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maint To help users plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). -To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the type of outages that can be handled. Multi-region Replication is when the active and replica are in different regions on the same cloud (e.g., AWS us-east-1 and AWS us-west-2). Multi-cloud Replication is when the active and replica are in different clouds (e.g., AWS and GCP). Same-region Replication is when the active and replica are in the same region. Temporal always places the active and replica in different [cells](/cloud/overview#cell-based-infrastructure). - -As Workflows progress in the active region, history events are asynchronously replicated to the replica. -Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region. -If an outage hits the active region or cell, Temporal Cloud will fail over to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started. - -The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the below RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default. - -1. **Availability zone outage**: - 1. _Applicable Namespaces:_ All Namespaces - 2. _Goals:_ Zero RPO and near-zero RTO - 3. _More details:_ Historically, these have been the most common type of outage in the cloud. Temporal Cloud replicates every Namespace across three availability zones. The failure of a single availability zone is handled automatically by Temporal Cloud behind the scenes, with no potential for data loss, and little-to-no observable downtime to the end user. -2. **Cell outage**: - 1. _Applicable Namespaces:_ Namespaces with Same-region Replication, Multi-region Replication, or Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ Temporal Cloud runs on a [cell architecture](/cloud/sla). Each cell contains the software and services necessary to host a Namespace. While unlikely, it's possible for a cell to experience a disruption due to uncaught software bugs or sub-component failures (e.g., an outage in the underlying database). -3. **Regional outage**: - 1. _Applicable Namespaces:_ Namespaces with Multi-region Replication or Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ On [rare occasions](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), an entire region within a cloud provider will be degraded. Since Namespaces depend on the cloud provider's infrastructure, Temporal Cloud is not immune to these outages. -4. **Cloud-wide outage**: - 1. _Applicable Namespaces:_ Namespaces with Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ An entire cloud provider has an outage across most or all regions. Since cloud providers strive to keep cloud regions de-coupled, these are the rarest outages of all. Still, they [have happened](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW) in the past. +## Types of outages Temporal Cloud designs around + +Temporal Cloud is engineered to withstand four broad categories of cloud outage. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. + +### Availability Zone outage + +An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition. + +Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. + +**Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. + +If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). + +:::note + +When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by disabling Temporal-managed failovers on the Namespace. + +::: + +#### RTO and RPO + +When using Temporal Cloud (no additional features required): + +- **Near-zero RTO.** When a single AZ fails, the remaining two AZs continue serving requests without a failover, so end users see little to no disruption. +- **Zero RPO.** Writes to Workflow state are synchronously replicated across all three AZs before being acknowledged back to the Client, so an AZ failure cannot cause data loss. + +### Cell outage + +Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). Each cell contains the software and services necessary to host a Namespace, and components within a cell are distributed across at least three Availability Zones. Cells provide a strong unit of isolation: a problem inside one cell does not propagate to other cells. + +**Example causes:** failure of a sub-component within the cell (for example, an individual database becoming unavailable) or a software bug introduced in a new deploy to the cell. + +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. + +Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. + +#### RTO and RPO + +When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +### Cloud Region outage + +A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. + +**Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. + +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. + +Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. + +#### RTO and RPO + +When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the regional disruption and fails the Namespace over to its replica in another region. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +### Cloud-wide outage + +On rare occasions, an issue affects most or all regions of a single cloud provider at once. + +**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. + +**Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. + +Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. + +#### RTO and RPO + +When using Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a different cloud provider. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region, even across cloud providers. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +## How High Availability replication works + +To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the categories of outage it can handle: + +- **Multi-region Replication** places the active and replica in different regions on the same cloud (for example, AWS us-east-1 and AWS us-west-2). +- **Multi-cloud Replication** places the active and replica in different cloud providers (for example, AWS and GCP). +- **Same-region Replication** (Preview) places the active and replica in the same region. + +Temporal always places the active and replica in different [cells](/cloud/overview#cell-based-infrastructure). + +As Workflows progress in the active region, history events are asynchronously replicated to the replica. Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region. If an outage hits the active region or cell, Temporal Cloud will fail over to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started. + +## Explaining Temporal Cloud's RTO and RPO + +The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the published RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default. + +### Summary table + +| Outage type | Applicable Namespaces | RPO | RTO | +| ---------------------- | ------------------------------------------------------------------------------ | -------------- | ---------------- | +| Availability Zone outage | All Namespaces | Zero | Near-zero | +| Cell outage | Namespaces with Same-region, Multi-region, or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| Cloud Region outage | Namespaces with Multi-region or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| Cloud-wide outage | Namespaces with Multi-cloud Replication | Under 1 minute | Under 20 minutes | Notes: @@ -64,8 +149,17 @@ Temporal highly recommends keeping Temporal-initiated failovers enabled. When Te - All Namespaces are backed up every 4 hours. If an outage causes data loss on a Namespace that was not protected by High Availability, then Temporal will use the backup to restore as much data as feasible. +- Temporal has internal goals and measurements for Recovery Time and Recovery Point, but does not publish the achieved Recovery Time and Recovery Point for each incident. -## Minimizing the Recovery Point +### Explaining the RPO + +:::note Temporal's Recovery Point is different from a traditional Recovery Point + +In a traditional database, data within the Recovery Point window may be permanently lost during a failover. In Temporal Cloud, that data is not lost. Cloud data stores are engineered for extreme durability (commonly 99.999999999%, or "11 nines"), so any data acknowledged by Temporal Cloud is durably persisted. After the outage resolves, Temporal's Recovery and Conflict Resolution process automatically syncs that data back into the Namespace. + +The Recovery Point Objective therefore reflects the maximum data that may be temporarily unavailable in the replica at the moment of failover, not the maximum data that could be permanently lost. + +::: Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including: @@ -83,7 +177,15 @@ Temporal recommends monitoring the replication lag and alerting should it rise t ::: -## Minimizing the Recovery Time +### Explaining the RTO + +The Recovery Time for a given incident is measured from the moment the incident begins to cause abnormal Namespace operation — for example, when unavailability or error rates rise above an acceptable level — to the moment the Namespace is restored to full functionality. + +For most incidents, the vast majority of the Recovery Time is spent detecting the incident, determining the affected boundary (a single cell, a region, or an entire cloud), and deciding to fail Namespaces over to their replicas. The actual time to complete the failover is usually a very small piece of the Recovery Time. + +This Recovery Time covers only the Temporal Namespace. Your application's overall Recovery Time also depends on having enough healthy Workers that can reach the Namespace and process Workflows. Maintaining sufficient Worker capacity that can reach the replica region (or replica cloud) during a failover is your responsibility. + +#### How Temporal achieves a low Recovery Time Temporal has put extensive work into tools and processes that minimize the recovery time and achieve its RTO for Temporal-initiated failovers, including: @@ -97,6 +199,8 @@ Temporal has put extensive work into tools and processes that minimize the recov - Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur. +#### How users can achieve a lower Recovery Time + To achieve the lowest possible recovery times, Temporal recommends that you: - Keep Temporal-initiated failovers enabled on your Namespace (the default) @@ -112,8 +216,7 @@ Users can trigger manual failovers on their Namespaces even if Temporal-initiate - Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet. - -## Understanding Temporal's RTO vs. SLA +#### Comparing RTO and SLA Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes and apply in different situations. From 412b8defb87f0d4621860786c928899a8ad0cc38 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 14 May 2026 12:09:22 -0700 Subject: [PATCH 3/7] Added blast radius to outage types --- docs/cloud/rto-rpo.mdx | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index f65ec3700e..d11fc3c086 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -34,6 +34,8 @@ An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. +**Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime. However, the outage will take out any Workers the user is running in that AZ. We recommend spreading Workers across multiple AZs to mitigate this. + **Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). @@ -57,6 +59,8 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Example causes:** failure of a sub-component within the cell (for example, an individual database becoming unavailable) or a software bug introduced in a new deploy to the cell. +**Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down. + **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. @@ -76,6 +80,8 @@ A cloud region as a whole can become degraded, with effects that span beyond any **Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. +**Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. Namespaces and Workers in other regions of the same cloud — and in other clouds — are unaffected. + **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. @@ -95,6 +101,8 @@ On rare occasions, an issue affects most or all regions of a single cloud provid **Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. +**Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected. + **Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. From c6cf12b6b70d8f13db280cef950fa17b679b9b69 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 14 May 2026 14:31:53 -0700 Subject: [PATCH 4/7] added SLA calculations to outage types --- docs/cloud/rto-rpo.mdx | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index d11fc3c086..de76b4b4f3 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -38,6 +38,8 @@ Historically, AZ outages are the most common type of outage in the cloud, and Te **Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility. + If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). :::note @@ -63,6 +65,8 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility. + Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. #### RTO and RPO @@ -84,6 +88,10 @@ A cloud region as a whole can become degraded, with effects that span beyond any **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-region Replication or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. + +If two or more regions in the same cloud provider experience an outage simultaneously, Temporal Cloud treats the event as a [Cloud-wide outage](#cloud-wide-outage). + Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. #### RTO and RPO @@ -97,14 +105,16 @@ Even though the RPO target is under 1 minute, data is virtually never "lost" tha ### Cloud-wide outage -On rare occasions, an issue affects most or all regions of a single cloud provider at once. +On rare occasions, an issue affects two or more regions of a single cloud provider at once. Any simultaneous outage of two or more regions in the same cloud provider is treated as a cloud-wide outage. -**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. +**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure, or two or more regions in the same cloud experiencing independent regional outages at the same time. **Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected. **Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. + Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. #### RTO and RPO From 9e288857f6fb001da5cd80d6e91dd384f28aa2f4 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 15 May 2026 13:52:25 -0700 Subject: [PATCH 5/7] Apply suggestions from code review Co-authored-by: Lenny Chen <55669665+lennessyy@users.noreply.github.com> Co-authored-by: Kevin Woo <3469532+kevinawoo@users.noreply.github.com> Co-authored-by: Luke Knepper --- docs/cloud/rto-rpo.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index de76b4b4f3..c3420443d0 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -22,17 +22,17 @@ import { ToolTipTerm } from '@site/src/components'; When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maintain the Namespace's availability and data durability. The time it takes to recover from the outage is called the "recovery time." The amount of data (event histories) lost is called the "recovery point." A durable system should have a low recovery time and recovery point. -To help users plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). +To help you plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). ## Types of outages Temporal Cloud designs around -Temporal Cloud is engineered to withstand four broad categories of cloud outage. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. +Temporal Cloud is engineered to withstand four broad categories of cloud outages. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. ### Availability Zone outage An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition. -Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. +Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to our customers. **Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime. However, the outage will take out any Workers the user is running in that AZ. We recommend spreading Workers across multiple AZs to mitigate this. @@ -40,11 +40,11 @@ Historically, AZ outages are the most common type of outage in the cloud, and Te **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility. -If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). +If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using [Same-region Replication](/cloud/high-availability#same-region-replication). :::note -When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by disabling Temporal-managed failovers on the Namespace. +When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by [disabling Temporal-managed failovers](cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. ::: @@ -63,7 +63,7 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down. -**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell within the same region. When any of these features enabled for a namespace, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility. From 165ad9957235178de1c909be7a8ffed4f7c59997 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 15 May 2026 13:54:14 -0700 Subject: [PATCH 6/7] Apply suggestions from code review Co-authored-by: Lenny Chen <55669665+lennessyy@users.noreply.github.com> Co-authored-by: Luke Knepper --- docs/cloud/rto-rpo.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index c3420443d0..f7db72c708 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -76,13 +76,13 @@ When using Same-region Replication, Multi-region Replication, or Multi-cloud Rep - **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. - **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. -Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when the outage is over. ### Cloud Region outage A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. -**Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. +**Example causes:** failure of a key cloud service in the region causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. **Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. Namespaces and Workers in other regions of the same cloud — and in other clouds — are unaffected. @@ -155,7 +155,7 @@ Notes: - The above goals are only applicable to Namespaces that have enabled Temporal-initiated failovers, which comes enabled by default. Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not cancel out or accidentally reverse a Temporal-initiated failover. -:::note +:::tip Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover. @@ -217,7 +217,7 @@ Temporal has put extensive work into tools and processes that minimize the recov - Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur. -#### How users can achieve a lower Recovery Time +#### Tips for a lower Recovery Time To achieve the lowest possible recovery times, Temporal recommends that you: From 556fbed7767f31a80a4d173fae020770aacdfe7d Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 15 May 2026 16:48:36 -0700 Subject: [PATCH 7/7] More info about failover timing and uses --- docs/cloud/high-availability/enable.mdx | 6 ++++ docs/cloud/high-availability/failovers.mdx | 34 ++++++++++++++++++- .../high-availability/ha-connectivity.mdx | 6 ++++ docs/cloud/high-availability/monitoring.mdx | 6 ++++ docs/cloud/rto-rpo.mdx | 11 ++++++ 5 files changed, 62 insertions(+), 1 deletion(-) diff --git a/docs/cloud/high-availability/enable.mdx b/docs/cloud/high-availability/enable.mdx index b4c2a29488..4d407e9923 100644 --- a/docs/cloud/high-availability/enable.mdx +++ b/docs/cloud/high-availability/enable.mdx @@ -20,6 +20,12 @@ Using private network connectivity with a HA namespace requires extra setup. See There are charges associated with Replication and enabling High Availability features. For pricing details, visit Temporal Cloud's [Pricing](/cloud/pricing) page. +:::tip White paper + +For an in-depth guide covering everything from why you need High Availability to setting it up in production and advanced options, read the [High Availability White Paper](https://temporal.io/pages/high-availability-white-paper). + +::: + ## Create a Namespace with High Availability features {#create} To create a new Namespace with High Availability features, you can use the Temporal Cloud UI or the tcld command line diff --git a/docs/cloud/high-availability/failovers.mdx b/docs/cloud/high-availability/failovers.mdx index daf47620bc..a0a40ce7a2 100644 --- a/docs/cloud/high-availability/failovers.mdx +++ b/docs/cloud/high-availability/failovers.mdx @@ -29,6 +29,12 @@ Returning control from the replica to the primary is called a