Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/cloud/high-availability/enable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ Using private network connectivity with a HA namespace requires extra setup. See
There are charges associated with Replication and enabling High Availability features. For pricing details, visit
Temporal Cloud's [Pricing](/cloud/pricing) page.

:::tip White paper

For an in-depth guide covering everything from why you need High Availability to setting it up in production and advanced options, read the [High Availability White Paper](https://temporal.io/pages/high-availability-white-paper).

:::

## Create a Namespace with High Availability features {#create}

To create a new Namespace with High Availability features, you can use the Temporal Cloud UI or the tcld command line
Expand Down
128 changes: 80 additions & 48 deletions docs/cloud/high-availability/failovers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,36 +22,62 @@ keywords:
import { ToolTipTerm, DiscoverableDisclosure, CaptionedImage } from '@site/src/components';

In case of an incident or an outage, Temporal will automatically <ToolTipTerm term="fail over" src="failover" /> your Namespace from the primary to the replica.
This lets Workflow Executions continue with minimal interruptions or data loss.
You can also [manually initiate failovers](/cloud/high-availability/failovers) based on your situational monitoring or for testing.
This lets in-flight Workflow Executions continue, new Workflow Executions start, and closed Workflow Executions be inspected, all with minimal interruptions or data loss.
You can also [manually trigger a failover](/cloud/high-availability/failovers) based on your own monitoring or for failover testing.

Returning control from the replica to the primary is called a <ToolTipTerm term="failback" />.
After a Temporal-managed failover, Temporal automatically fails back to the original region once it is healthy.
See [Returning to the primary with failbacks](#failbacks) for details on automatic and manual failback options.

## Failovers
:::tip White paper

Occasionally, a Namespace may become temporarily unavailable due to an unexpected incident.
Temporal Cloud detects these issues using regular health checks.
For an in-depth guide covering everything from why you need High Availability to setting it up in production and advanced options, read the [High Availability White Paper](https://temporal.io/pages/high-availability-white-paper).

### Health checks
:::

Temporal Cloud monitors error rates, latencies, and infrastructure problems, such as request timeouts.
If it finds unhealthy conditions where indicators exceed the allowed thresholds, Temporal automatically switches the primary to the replica.
In most cases, the replica is unaffected by the issue.
This process is known as failover.
## Automatic failover

### Automatic failovers
When an unexpected outage hits your Temporal Namespace, failing over to a healthy cloud region can prevent data loss and application interruptions.
After a failover, in-flight Workflows continue, new Workflows start, and closed Workflows can be inspected, even while the Namespace's original region is unhealthy.

Failovers prevent data loss and application interruptions.
Existing Workflows continue, and new Workflows start as the incident is addressed.
Temporal Cloud offers managed outage detection and failover to all Namespaces that use High Availability.
Temporal-managed failovers, also known as "automatic failovers," keep your Temporal Cloud Namespace available without manual intervention from you.
We aim to both detect the outage and complete a Temporal-managed failover in minutes from when the outage began, according to our stated [Recovery Time Objective (RTO)](/cloud/rto-rpo).

Temporal Cloud handles failovers automatically, ensuring continuity without manual intervention.
Once the incident is resolved, Temporal Cloud automatically performs a [failback](#failbacks), shifting Workflow Execution processing back to the original region.
After a Temporal-managed failover, your Namespace will have a replica in its original region.
Once the original region is healthy again, Temporal Cloud automatically performs a [failback](#failbacks), moving your Namespace back home.

<CaptionedImage src="/img/cloud/high-availability/failover.png" title="On failover, the replica becomes active and the Namespace endpoint directs access to it." />

For more control over the failover process, you can [disable automated failovers](/cloud/high-availability/failovers#disabling-temporal-initiated).
To opt-out of Temporal-managed failovers and its RTO, you can [disable automated failovers](/cloud/high-availability/failovers#disabling-temporal-initiated).
lth checks.

### Conditions that trigger an automatic failover

While the failover operation itself usually completes in seconds, the bulk of the Recovery Time in an outage is spent detecting the disruption and deciding to trigger a failover. See [How long does a failover take?](#failover-duration) for a detailed breakdown.

To achieve Temporal Cloud's Recovery Time Objective (RTO) for Namespaces that have enabled High Availability and Temporal-managed failovers (also known as "automatic failovers"), Temporal Cloud runs automated Workflows that detect outages and trigger failovers.
These Workflows continuously monitor the health of Temporal Cloud in every region and every cell.

The main conditions these Workflows check are listed below.
If any of these conditions are failing for too long, Temporal Cloud automatically triggers a failover on any Namespaces with High Availability that have a healthy replica.
Additionally, Temporal's on-call engineers may trigger a failover at their discretion, for example, if they see early signs of a regional outage.

:::note

The following list is meant to give Temporal Cloud users a general idea of the conditions that trigger a Temporal-managed failover.
This is not an exhaustive list of all cases, and it may change over time.

:::

#### Example conditions monitored

1. Whether Temporal Cloud's services in the cell are reachable from the control plane. Unreachable services are considered "unhealthy".
2. The average latency of inbound RPC calls (excluding long-polling APIs) to Temporal services in the cell. If the average latency rises too high over a rolling time window, this condition is considered "unhealthy".
3. The percentage of inbound RPC calls that returned errors related to server health. If the percentage rises too high over a rolling time window, this condition is considered "unhealthy".
4. The average latency of calls from Temporal Cloud's services in the cell to its persistence layer. If the average latency rises too high over a rolling time window, this condition is considered "unhealthy".
5. The percentage of recent calls to the persistence layer that returned errors related to persistence health. If the percentage rises too high over a rolling time window, this condition is considered "unhealthy".


:::tip

Expand All @@ -72,15 +98,15 @@ After failover, be aware of the following points:

### The failover process {#failover-process}

Temporal's automated failover process works as follows:
Temporal's failover process works as follows:

- During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync.
- If the primary becomes unavailable, Temporal detects the issue through health checks.
It automatically switches to the replica, using one of its available [failover scenarios](#scenarios).
- The replica takes over the active role and becomes the primary.
1. During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync.
1. A failover is triggered, either automatically by Temporal or manually by a user.
1. The replica takes over and the Namespace becomes active in the replica's cloud region.
Operations continue with minimal disruption.
- When the original primary recovers, the roles can either switch back (failback, by default) or remain as they are, based on your Namespace settings.
Automatic role switching with failover and failback minimizes downtime for consistent availability.
1. If the failover was triggered by Temporal, when the original primary region recovers, Temporal triggers another failover to fail back to the Namespace's original region.
(It is possible to opt-out of this automatic fail back)
1. If the failover was triggered by a user, then the Namespace will continue as-is until a user triggers another failover.

:::info

Expand All @@ -89,11 +115,31 @@ This update is replicated through the Namespace metadata mechanism.

:::

### How long does a failover take? {#failover-duration}

The time to complete a failover depends on who triggered it.

#### User-triggered failover

A failover that you trigger yourself happens in two stages:

1. **The Namespace becomes active in the other region.** Temporal Cloud completes this stage within 10 seconds (internal SLO). Existing Workflow Executions resume in the new active region, and new Workflow Executions can be started.
2. **The Namespace Endpoint re-routes to the active region.** This DNS change can take a few minutes to fully propagate to all Clients and Workers. If your application has an extremely demanding Recovery Time, you can eliminate this stage by connecting through a [Regional Endpoint](/cloud/high-availability/ha-connectivity#regional-endpoint) instead of the Namespace Endpoint. Regional Endpoints require more setup, so most users should stick with the default Namespace Endpoint.

#### Temporal-triggered failover

A failover that Temporal triggers in response to an outage also happens in two stages:

1. **Detecting the outage.** This is the bulk of the Recovery Time. Outages are rarely black and white; they often start as a slow degradation. Temporal continuously runs the automated health checks described in [Conditions that trigger an automatic failover](#conditions-that-trigger-an-automatic-failover).
2. **Triggering the failover commands.** Once detection completes, Temporal triggers failovers across all impacted Namespaces.

## Failover scenarios {#scenarios}

The Temporal Cloud failover mechanism supports several modes for executing Namespace failovers.
These modes include graceful failover ("handover"), forced failover, and a hybrid mode.
The hybrid mode is Temporal Cloud’s default Namespace behavior.
A failover on Temporal Cloud always executes in a "hybrid" fashion:
1. It first attempts a "graceful" failover
2. If the graceful failover does not complete after 10 seconds, then it triggers a "forced" failover.

This strategy balances _consistency_ and _availability_ requirements.

### Graceful failover (handover) {#graceful-failover}

Expand Down Expand Up @@ -122,28 +168,6 @@ Events not replicated due to replication lag undergo conflict resolution upon re

This mode prioritizes _availability_ over consistency.

### Hybrid failover mode {#hybrid-failover}

While graceful failovers are preferred for consistency, they aren’t always practical.
Temporal Cloud’s hybrid failover mode (the default mode) limits the initial graceful failover attempt to 10 seconds or less.

During this period:

- Existing Workflows stop progress.
- Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.

If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover.

This strategy balances _consistency_ and _availability_ requirements.

### Scenario summary

| Failover Scenario | Characteristics |
| ---------------------------- | ------------------------------------------------------- |
| Graceful failover (handover) | Favors _consistency_ over availability. |
| Forced failover | Prioritizes _availability_ over consistency. |
| Hybrid failover mode | Balances _consistency_ and _availability_ requirements. |

## Network partitions

At any time only the primary or the replica is active.
Expand Down Expand Up @@ -191,6 +215,14 @@ A forced failover when there is a significant replication lag has a higher likel

:::

### When to trigger a manual failover {#when-to-trigger}

Most Namespaces with High Availability are well-served by Temporal-managed failovers. The cases where a manual failover is warranted are:

- **Testing failover or migrating to a new region.** A manual failover is the standard way to exercise your failover process with your Clients and Workers, or to move a Namespace to a different region.
- **An outage that affects only your systems.** If an outage is contained to your application, Workers, or other infrastructure — and Temporal Cloud is not affected — Temporal will not initiate a failover on your behalf. Detect the outage with your own monitoring and trigger a failover yourself.
- **Failing over more aggressively during a regional outage.** Even with Temporal-managed failovers enabled, you can still trigger a failover yourself if you detect a regional outage before Temporal does. Whichever failover happens first takes effect, and the later one is a no-op, so a user-triggered failover does not conflict with Temporal's automatic failover. This can help you achieve a lower Recovery Time when every minute matters.

### Trigger the failover {#manual-failovers}

You can trigger a failover manually using the Temporal&nbsp;Cloud Web&nbsp;UI, the tcld CLI, or the Cloud Ops API, depending on your preference and setup.
Expand Down
6 changes: 6 additions & 0 deletions docs/cloud/high-availability/ha-connectivity.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ This page covers:
- How to choose between the Namespace Endpoint and a Regional Endpoint for a Namespace with High Availability features.
- How to configure PrivateLink so that failover remains transparent to Workers on private networks.

:::tip White paper

For an in-depth guide covering everything from why you need High Availability to setting it up in production and advanced options, read the [High Availability White Paper](https://temporal.io/pages/high-availability-white-paper).

:::

## How to choose an endpoint for a Namespace with High Availability features

Temporal Cloud exposes two kinds of gRPC endpoints for a Namespace.
Expand Down
6 changes: 6 additions & 0 deletions docs/cloud/high-availability/monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,12 @@ import { ToolTipTerm } from '@site/src/components';
Temporal Cloud offers several ways for you to track the health and performance of your
[High Availability](/cloud/high-availability) namespaces.

:::tip White paper

For an in-depth guide covering everything from why you need High Availability to setting it up in production and advanced options, read the [High Availability White Paper](https://temporal.io/pages/high-availability-white-paper).

:::

## Replication status

You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud disables the
Expand Down
Loading
Loading