From 5c553ca8aa022f30a9339292eb61de57917efa44 Mon Sep 17 00:00:00 2001 From: stevep0z <255929980+stevep0z@users.noreply.github.com> Date: Tue, 23 Jun 2026 12:14:05 -0500 Subject: [PATCH 1/4] chore: add Stellar Channels GCP operator deployment guide --- .../stellar-relayer-gcp-operator-guide.mdx | 1136 +++++++++++++++++ src/navigation/stellar.json | 5 + 2 files changed, 1141 insertions(+) create mode 100644 content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx diff --git a/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx b/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx new file mode 100644 index 00000000..76ee00e0 --- /dev/null +++ b/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx @@ -0,0 +1,1136 @@ +--- +title: 'Hosted Stellar Relayer on GCP: Operator Deployment Guide' +--- + +A step-by-step guide for infrastructure teams running a hosted Stellar relayer service on Google Cloud Platform. + +**Who this is for:** infrastructure operators who have run production GCP workloads but are new to OpenZeppelin's relayer stack. + +**What you get:** a hosted Stellar Channels service in your own GCP project, sized to serve the same workload OpenZeppelin runs today (roughly 2M+ transactions per day across about 2,500 relayers). + +## 1. Overview + +OpenZeppelin runs a hosted Stellar relayer service at `channels.openzeppelin.com` (mainnet) and `channels.openzeppelin.com/testnet` (testnet). The service takes on the hard parts of submitting Stellar transactions in parallel: managing a pool of channel accounts, fee bumping, arbitrating sequence numbers, and failing over between RPC providers. Downstream callers just talk to a simple HTTP API. + +This guide shows you how to run that same service in your own GCP project. + +### What You End Up With + +By the end of this guide you will have: + +- A production-ready hosted Stellar Channels service in your own GCP project, served from a domain you control (for example, `channels.your-company.com`). +- A Cloud Run compute tier with autoscaling, sitting behind an External HTTPS Load Balancer with a Google-managed SSL certificate. +- Memorystore Redis for state and deferred-job scheduling. In production this runs as STANDARD_HA with automatic failover. +- Eight Pub/Sub topics and subscriptions that handle the distributed transaction-processing pipeline (when `queue_backend = "pubsub"`). +- An optional Cloudflare Worker in front of the load balancer for self-serve API-key issuance (the `/gen` flow), per-user rate limiting, and usage analytics. +- A Secret Manager entry for every secret. Secrets are injected as environment variables when the container starts. +- Cloud KMS for ED25519 transaction signing. The module provisions a keyring and an asymmetric signing key. +- An Artifact Registry remote repository configured to proxy the public ECR image, giving Cloud Run a GCP-native pull path. +- Optional Cloud Functions for fund-relayer balance monitoring. + +The service handles two transaction-submission modes: + +- **Signed XDR mode:** the caller signs a complete Stellar transaction envelope and submits it. The service only fee-bumps and submits. +- **Soroban `func` + `auth` mode:** the caller submits a Soroban host function plus authorization entries. The service assembles the transaction, simulates it, signs with a channel account, fee-bumps, and submits. + +### What This Guide Assumes You Already Have + +- A strong GCP background: VPC, Cloud Run, IAM, Cloud DNS, Memorystore, Pub/Sub. +- Terraform fluency (1.5.0 or later). +- A target GCP project where you can create the full resource set. +- A domain you control. DNS can live in Route53, Cloud DNS, or another provider. +- Optionally, a Cloudflare account if you want the `/gen` API-key gateway. + +## 1.5 How Channels Works on Stellar + +Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time. This is the constraint that limits parallel throughput on Stellar. + +The Channels service works around it with a pool of dedicated source accounts: the channel accounts. Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. The pool size determines how many transactions can run in parallel. + +The fund account is a separate Stellar account that holds the XLM balance. When the service submits a transaction, it wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys. + +The pool size you provision in Step 5.10 is your throughput ceiling. See §10.1 for the sizing formula before you bootstrap. + +## 2. Architecture + +### Cloud Architecture + +```mermaid +flowchart TD + Callers([Public callers]) + + subgraph Edge["Edge (Cloudflare, optional)"] + Worker["Cloudflare Worker
• /gen + /testnet/gen — issues API keys
• KV-backed auth, hashes with KEY_SALT
• per-IP / per-key rate limits
• rewrites Bearer→static, sets x-consumer-key
• usage tracking via Analytics Engine"] + end + + subgraph GCPEdge["GCP Edge"] + LB["External HTTPS Load Balancer
Google-managed SSL cert · HTTPS-only
HTTP→HTTPS redirect · Global static IP"] + end + + subgraph Compute["Compute"] + CloudRun["Cloud Run Service
relayer container · autoscaling 2..N instances
health: /api/v1/health · VPC connector for Redis"] + end + + subgraph State["Data plane"] + Redis[("Memorystore Redis
STANDARD_HA failover")] + PubSub[("Pub/Sub — 8 topics + subs")] + Secrets[("Secret Manager
4 secrets")] + end + + subgraph Signing["Signing"] + KMS["Cloud KMS
ED25519 keyring"] + end + + Stellar([Stellar RPC
Soroban + Horizon]) + GAR[(Artifact Registry
remote repo → ECR Public)] + + Callers --> Worker + Worker -->|"Bearer = static-key
x-consumer-key = user-key"| LB + LB --> CloudRun + CloudRun --> Redis + CloudRun --> PubSub + CloudRun --> Secrets + CloudRun --> KMS + CloudRun --> Stellar + GAR -.->|image pull| CloudRun +``` + +The whole stack above is provisioned by the `gcp` Terraform module in `OpenZeppelin/relayer-channels-infra`. You consume it either by cloning the repo or by referencing it as an external module from your own Terraform. + +### Components + +| Component | GCP Service | Purpose | +| --- | --- | --- | +| Edge gateway | Cloudflare Worker + KV (optional) | API-key issuance, rate limiting, usage tracking | +| Load balancer | External HTTPS LB + Google-managed cert | TLS termination, HTTPS-only, health-checked routing | +| Compute | Cloud Run v2 Service | Runs the relayer container with autoscaling | +| State | Memorystore Redis 7.2 | Transaction records, sequence counters, distributed locks | +| Queue | 8 Pub/Sub topics + 8 subscriptions | Distributed transaction processing pipeline | +| Secrets | Secret Manager | API keys, admin secrets, encryption keys | +| Signing | Cloud KMS (EC_SIGN_ED25519) | Transaction signing for fund + channel accounts | +| Image registry | Artifact Registry (remote repo) | Proxies ECR Public image for Cloud Run | +| Observability | Cloud Logging + Cloud Monitoring | Application logs, metrics | +| Networking | VPC + VPC Connector + Private Service Access | Private connectivity to Memorystore | +| Optional monitors | Cloud Functions + Cloud Scheduler | Balance-check function | + +### App Architecture (Channels Plugin Runtime) + +```mermaid +flowchart TD + Client([API Client]) + + subgraph Relayer["Relayer API (openzeppelin-relayer)"] + Auth["Bearer auth (API_KEY from Secret Manager)
+ rate-limit middleware
+ route to plugin"] + end + + subgraph Plugin["Channels Plugin Runtime"] + Pipeline["Submission pipeline
1. Validation: auth entries, payload, scheme
2. ChannelPool: acquire a channel relayer
3. Build + Simulate: assemble Soroban tx
4. Sign + FeeBump: channel signs, fund FeeBumps
5. Submit + Wait: POST to RPC, poll status"] + Mgmt["Management API
setChannelAccounts / listChannelAccounts
setFeeLimit / getFeeUsage / getFeeLimit"] + end + + Redis[("Memorystore
state + deferred jobs")] + PubSub[("Pub/Sub
jobs")] + Accts[("Fund acct
+ channel accts
(Cloud KMS-backed)")] + Stellar([Stellar RPC]) + + Client -->|"POST /api/v1/plugins/channels/call
body: { params: { xdr } } OR { params: { func, auth } }"| Auth + Auth --> Pipeline + Auth --> Mgmt + Pipeline <--> Redis + Pipeline <--> PubSub + Mgmt <--> Redis + Pipeline -->|sign| Accts + Accts -->|signed envelope| Stellar + Pipeline -->|submit + poll| Stellar +``` + +### Transaction Lifecycle + +```mermaid +sequenceDiagram + autonumber + actor Caller + participant CF as CF Worker + participant LB as HTTPS LB + participant API as Relayer API + participant Plugin as Channels Plugin + participant Redis as Memorystore + participant PS as Pub/Sub + participant KMS as Cloud KMS + participant RPC as Soroban RPC + + Caller->>CF: POST / · Bearer user-key + CF->>CF: hash + KV lookup
+ scope check + CF->>LB: rewrite Bearer→static-key
set x-consumer-key=user-key + LB->>API: TLS terminate · forward + API->>Plugin: route /plugins/channels/call + Plugin->>Redis: check fee budget + Plugin->>Redis: persist tx record + Plugin->>PS: publish transaction-request + Plugin-->>Caller: 202 Accepted + tx_id + + rect rgba(200, 220, 255, 0.4) + Note over Plugin,RPC: Async worker pickup (after 202 returns) + Plugin->>Redis: acquire channel account + Plugin->>RPC: build + simulate tx + RPC-->>Plugin: assembled envelope + Plugin->>KMS: sign w/ channel signer + KMS-->>Plugin: signature + Plugin->>KMS: fee-bump w/ fund signer + KMS-->>Plugin: fee-bumped envelope + Plugin->>RPC: submit signed envelope + RPC-->>Plugin: submitted (no hash yet) + Plugin->>PS: publish status-check-stellar + + loop until confirmed or expired + Plugin->>RPC: GET tx by hash + RPC-->>Plugin: pending / confirmed + end + + Plugin->>Redis: update tx record → confirmed + end +``` + +### Pub/Sub Queue Topology + +The relayer's distributed processing layer uses eight Pub/Sub topics with pull subscriptions. The Pub/Sub backend handles retries through Redis sorted sets (a store-and-run-when-due pattern), so there are no dead-letter topics. + +```mermaid +flowchart TD + subgraph Producers["Producers"] + APIReq[API request] + WorkerCb[Worker callback] + DueSweep[Redis due-sweep] + end + + subgraph Topics["8 Pub/Sub topics + subscriptions"] + Q1["transaction-request"] + Q2["transaction-submission"] + Q3["status-check"] + Q4["status-check-evm"] + Q5["status-check-stellar"] + Q6["notification"] + Q7["token-swap-request"] + Q8["relayer-health-check"] + end + + Workers["Cloud Run instances
One worker pool per queue type"] + DeferredQ[("Redis sorted sets
Deferred jobs with backoff")] + + Producers --> Topics + Topics -->|pull + ack| Workers + Workers -. retry with backoff .-> DeferredQ + DeferredQ -. publish when due .-> Topics +``` + +**Deferred job pattern:** Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) are stored in Redis sorted sets keyed by their due time. A due-sweep worker runs every 1 to 5 seconds per queue type, claims due jobs from Redis, and publishes them to the topic. The topic only ever carries jobs that are already due. + +### Capacity Profile + +The reference deployment OpenZeppelin runs handles a growing load of about 3M transactions per day, served by roughly 1,000 relayers (fund and channel-account entities combined). The module defaults are sized conservatively for new deployments. Expect to grow into something closer to the production shape as your workload scales. + +| Resource | Module default (prod) | Current GCP deployment | +| --- | --- | --- | +| CPU | 1 vCPU | **4 vCPU** | +| Memory | 2 Gi | **8 Gi** | +| Min instances | 2 | **3** | +| Max instances | 10 | **20** | +| Redis tier | STANDARD_HA | STANDARD_HA | +| Redis memory | 5 GB | 5 GB | + +The module defaults work fine for a new deployment that is ramping up. The GCP deployment was raised above defaults to handle concurrent transaction stress testing. Tune further as your workload grows. + +--- + +## 3. Prerequisites + +GCP access, tooling, and Stellar-side accounts must be in place before you run `terraform apply`. + +### Accounts and Access + +- A **GCP project** with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings. +- A **service account** for Terraform with these roles: + - `roles/editor` for general resource creation + - `roles/resourcemanager.projectIamAdmin` to grant IAM roles to service accounts + - `roles/compute.networkAdmin` for VPC peering used by Private Service Access + - `roles/cloudkms.admin` to create KMS keyrings and keys + - `roles/pubsub.admin` to create topics and subscriptions and set IAM policies + - `roles/secretmanager.admin` to create secrets and set IAM policies + - `roles/run.admin` to manage Cloud Run services + - `roles/artifactregistry.admin` to create repositories and set IAM policies +- A **domain** you control, with access to create DNS records (Route53, Cloud DNS, or another provider). +- Optionally, a **Cloudflare account** with a zone matching your domain, if you want the `/gen` API-key gateway. + +### Tooling + +| Tool | Version | Why | +| --- | --- | --- | +| Terraform | 1.5.0 or later | Module language constraints | +| Google provider | 5.0 or later, below 7.0 | Pinned in `versions.tf` | +| Cloudflare provider | ~> 5.0 | Required even when `enable_cloudflare = false` (a Terraform constraint) | +| gcloud CLI | recent stable | Auth, Artifact Registry, debugging | +| Node.js 18+ and pnpm 10+ | recent stable | Only if you modify the Channels plugin | + +### Stellar-Side Prerequisites + +- **Soroban RPC access:** for mainnet, use at least two independent private providers from different infrastructure operators (QuickNode and Ankr are the providers OpenZeppelin uses). "Independent" means different node operators, not different API wrappers on the same underlying node. The public image ships with a public RPC endpoint by default; override it with private providers after deployment (see Step 5.8). +- **Initial XLM funding:** each Stellar account requires a minimum base reserve of 1 XLM. For 200 channel accounts plus the fund account, budget at least 250 XLM before transaction fees. Fund the fund relayer's Stellar account first — `oz-channels bootstrap` draws channel account balances from it. + +### Reference Repositories + +| Repo | Role | Visibility | +| --- | --- | --- | +| `OpenZeppelin/relayer-channels-infra` | Terraform modules and operator CLIs (`oz-relayer`, `oz-channels`) | Public | +| `OpenZeppelin/openzeppelin-relayer` | The relayer application | Public | +| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | Public | + +--- + +## 4. Environments + +We recommend running separate environments with isolated state: + +| Environment | Stellar network | GCP project pattern | Cloud Run service | Pub/Sub prefix | +| --- | --- | --- | --- | --- | +| `prod` | Stellar Mainnet | Production project | `relayer-channels-service` | `relayer-mainnet-prod-` | +| `stg` | Stellar Testnet | Same or separate project | `relayer-channels-stg-service` | `relayer-testnet-stg-` | + +The module derives service naming from `app_name` plus `environment`. When `environment = "prod"`, the resource-name suffix is dropped. For other environments, names are suffixed with `-`. + +Each environment gets its own: + +- Terraform state (use separate GCS backend prefixes). +- Terraform working directory (`examples/gcp/` for stg, `examples/gcp-prod/` for prod). +- VPC connector CIDR range (for example `10.8.0.0/28` for stg and `10.9.0.0/28` for prod if they share a VPC). +- Secret Manager secrets, KMS keys, and Pub/Sub topics. +- Cloudflare Worker, if enabled, with distinct names like `relayer-channels-stg-gcp-gateway`. + +--- + +## 5. Step-by-Step Deployment + +Full provisioning sequence from authentication through end-to-end verification. Steps 5.1–5.4 set up credentials and configuration; 5.5–5.6 set up the container image and apply infrastructure; 5.7–5.11 wire up DNS, RPC endpoints, signers, and channel accounts. + +### Step 5.1: Set Up Authentication + +```bash +export GOOGLE_APPLICATION_CREDENTIALS="$HOME/path/to/service-account-key.json" +``` + +If your GCP org blocks `gcloud auth application-default login`, use a service account key file instead (IAM & Admin > Service Accounts > Keys > Create new key > JSON). + +### Step 5.2: Get the Module + +**Option A, reference as an external module (recommended):** + +```hcl +module "relayer_channels" { + source = "git::https://github.com/OpenZeppelin/relayer-channels-infra.git//modules/gcp?ref=main" + # ... variables +} +``` + +**Option B, clone the repo:** + +```bash +git clone https://github.com/OpenZeppelin/relayer-channels-infra.git +cd relayer-channels-infra/examples/gcp # or examples/gcp-prod +``` + +### Step 5.3: Configure the Terraform Backend + +In `versions.tf`, configure remote state. Do not keep state on a laptop in production. + +```hcl +terraform { + backend "gcs" { + bucket = "your-org-terraform-state" + prefix = "relayer-channels/prod.tfstate" + } +} +``` + +Initialize: + +```bash +terraform init +``` + +### Step 5.4: Create Your tfvars + +```bash +cp terraform.tfvars.example terraform.tfvars +``` + +Minimum required configuration: + +```hcl +project_id = "my-gcp-project" +region = "us-east1" +environment = "prod" # or "stg" +network = "default" +subnetwork = "default" +domain_name = "channels.your-company.com" +container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-latest" +stellar_network = "mainnet" # or "testnet" +queue_backend = "pubsub" + +# Secrets, never commit these +relayer_api_key = "" # set via TF_VAR_relayer_api_key +channels_admin_secret = "" # set via TF_VAR_channels_admin_secret +storage_encryption_key = "" # set via TF_VAR_storage_encryption_key +``` + +Generate secrets: + +```bash +export TF_VAR_relayer_api_key="$(uuidgen | tr '[:upper:]' '[:lower:]')" +export TF_VAR_channels_admin_secret="$(openssl rand -base64 32)" +export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)" +export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)" # must be base64-encoded 32 bytes +``` + +### Step 5.5: Set Up Artifact Registry + +Cloud Run cannot pull directly from ECR Public. Configure an Artifact Registry remote repository to proxy it: + +1. GCP Console > **Artifact Registry** > **Create Repository** +2. Format: **Docker**, Mode: **Remote**, Source: **Custom**, URL: `https://public.ecr.aws` +3. Name it `ecr-public`, choose your region + +Then reference the proxied image in your `container_image` tfvar (as shown in Step 5.4). + +Tag scheme: `mainnet-` (pinned, recommended for prod), `mainnet-latest` (tracks latest), `testnet-`, `testnet-latest`. + + +The public image ships with a public Soroban RPC endpoint that rate-limits under production load. Override it with private providers after deployment in Step 5.8. + + +### Step 5.6: Plan and Apply + +```bash +terraform plan -out plan.tfplan +terraform apply plan.tfplan +``` + +The initial apply takes 10 to 15 minutes. Memorystore provisioning is the slowest leg. Private Service Access peering and SSL cert provisioning also take a few minutes. + +**Key outputs:** + +| Output | Used for | +| --- | --- | +| `cloud_run_service_name` | Service management, `gcloud run` commands | +| `cloud_run_service_uri` | Direct Cloud Run access (bypasses the LB) | +| `load_balancer_ip` | DNS record creation | +| `redis_host` | Manual Redis inspection (from a VM in the VPC) | +| `pubsub_topics` | Map of queue names to Pub/Sub topic names | +| `kms_signing_key_id` | Full KMS key ID for signer creation | +| `artifact_registry_url` | Artifact Registry URL | + +### Step 5.7: Set Up DNS and SSL + +The Google-managed SSL certificate needs DNS to point at the load balancer IP before it can provision. + +**Without Cloudflare:** + +1. Create an A record: `channels.your-company.com` to ``. +2. Wait 15 to 60 minutes for the certificate to provision (check status in GCP Console > Network Services > Load Balancing > certificate tab). + +**With Cloudflare:** + +1. Create a Cloudflare A record: `channels.your-company.com` to `` (proxy OFF initially, grey cloud). +2. Create a Route53 A record: `channels.your-company.com` to ``. +3. Wait for the Google-managed cert to become ACTIVE. +4. Switch Route53 to a CNAME: `channels.your-company.com` to `channels.your-company.com.cdn.cloudflare.net`. +5. Turn the Cloudflare proxy ON (orange cloud). + +### Step 5.8: Override RPC Endpoints + +The public image ships with a public Soroban RPC endpoint that rate-limits under production load. After the service is healthy, override it with private providers. This is a one-time call — the config persists in Redis across restarts. + +```bash +curl -s \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -X PATCH https://channels.your-company.com/api/v1/networks/stellar:mainnet \ + -d '{ + "rpc_urls": [ + { "url": "https://your-primary-rpc.com/key", "weight": 100 }, + { "url": "https://your-secondary-rpc.com/key", "weight": 100 } + ] + }' +``` + +Verify: + +```bash +curl -s -H "Authorization: Bearer " \ + "https://channels.your-company.com/api/v1/networks?per_page=200" \ + | jq '.data[] | select(.id=="stellar:mainnet") | .rpc_urls' +``` + +Use at least two independent providers from different operators. The relayer load-balances by weight and rotates on failure. + + +Re-run this PATCH only if you restart with `RESET_STORAGE_ON_START=true`, which wipes Redis including the network config. Normal restarts and redeployments preserve it. + + +### Step 5.9: Create the Fund-Relayer Signer + +Create a Cloud KMS signer using the provided script: + +```bash +ENV=mainnet API_KEY="$TF_VAR_relayer_api_key" \ +GCP_SA_KEY_FILE="$HOME/path/to/sa-key.json" \ +./scripts/gcp-kms-signer.sh +``` + +This calls the relayer API with `"type": "google_cloud_kms"` and creates a signer backed by the Cloud KMS key that Terraform provisioned. + +Then create the fund relayer: + +```bash +curl -s -X POST https://channels.your-company.com/api/v1/relayers \ + -H "Authorization: Bearer $TF_VAR_relayer_api_key" \ + -H "Content-Type: application/json" \ + -d '{ + "id": "channels-fund", + "name": "channels-fund", + "network": "mainnet", + "signer_id": "", + "network_type": "stellar", + "paused": false, + "policies": { "min_balance": 0, "fee_payment_strategy": "relayer" } + }' +``` + +### Step 5.10: Bootstrap the Channel-Account Pool + + +Size the pool before bootstrapping. Formula: `min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor)`. Stellar settlement averages 5 to 7 seconds; use 1.5x as a safety factor. At 23 TPS sustained that gives 173 channels minimum (see §10.1 for detail). For a new deployment with no existing traffic, 50 to 100 channels is a reasonable starting point. Use `--dry-run` to preview what will be created before committing. + + +Install the `oz-channels` CLI from the `cli/` directory in this repo: + +```bash +# From the root of relayer-channels-infra +cd cli +bun install +bun run build + +# Link the CLIs globally +cd packages/oz-channels && bun link +cd ../oz-relayer && bun link + +# Verify +oz-channels --help +oz-relayer --help +``` + +Requires the [Bun](https://bun.sh) runtime (Node.js 22+ compatible). + +Create a profile and bootstrap: + +```bash +oz-channels profile init prod-mainnet +# Prompts for: URL, API key, plugin ID (channels), admin secret, network + +# Preview +oz-channels bootstrap --to 200 --dry-run -p prod-mainnet + +# Provision +oz-channels bootstrap --to 200 -p prod-mainnet +``` + +### Step 5.11: Verify End-to-End + +```bash +# Health check +curl -sS https://channels.your-company.com/api/v1/health + +# Generate an API key (if Cloudflare is enabled) +curl -X POST https://channels.your-company.com/gen + +# Smoke test +oz-channels smoke run -p prod-mainnet +``` + +A healthy service returns `{"status":"ok"}` on the health check. The smoke test submits a test transaction end-to-end and polls for confirmation — success prints a confirmed transaction ID. If the smoke test times out without confirmation, check channel pool size (`oz-channels channels list -p prod-mainnet`) and fund account balance (`oz-relayer relayer balance channels-fund -p prod-mainnet`) before debugging further. + +--- + +## 6. Configuration Reference + +Reference for all environment variables and secrets the module manages automatically. See §11 for the full Terraform variable listing. + +### Module-Managed Container Environment Variables + +The Terraform module sets these. Do not override them unless you have a specific reason. + +| Env var | Set to | Source | +| --- | --- | --- | +| `HOST` | `0.0.0.0` | Module | +| `STELLAR_NETWORK` | `var.stellar_network` | Module | +| `FUND_RELAYER_ID` | `var.fund_relayer_id` | Module | +| `API_KEY_HEADER` | `x-consumer-key` | Module, keyed to the Cloudflare Worker rewrite | +| `REPOSITORY_STORAGE_TYPE` | `redis` | Module | +| `RESET_STORAGE_ON_START` | `false` | Module | +| `METRICS_ENABLED` | `true` | Module | +| `METRICS_PORT` | `8081` | Module | +| `LOG_FORMAT` | `json` | Module | +| `LOG_LEVEL` | `var.log_level` | Module | +| `REDIS_URL` | `redis://:` | Module, derived from Memorystore | +| `REDIS_READER_URL` | `redis://:` | Module, falls back to primary on BASIC tier | +| `GCP_PROJECT_ID` | `var.project_id` | Module | +| `GCP_REGION` | `var.region` | Module | +| `DISTRIBUTED_MODE` | `var.distributed_mode` | Module | +| `QUEUE_BACKEND` | `var.queue_backend` (when distributed) | Module | +| `PUBSUB_TOPIC_PREFIX` | Auto-derived: `relayer-{network}-{environment}` | Module | +| `PUBSUB_PROJECT_ID` | `var.project_id` | Module | + +### Module-Managed Secrets (from Secret Manager) + +| Container env var | Secret Manager ID | Required? | Notes | +| --- | --- | --- | --- | +| `API_KEY` | `{app_name}-relayer-api-key` | Yes | Authenticates all API requests to the relayer | +| `PLUGIN_ADMIN_SECRET` | `{app_name}-channels-admin-secret` | Yes | Required for channel management operations | +| `WEBHOOK_SIGNING_KEY` | `{app_name}-webhook-signing-key` | Optional | Only created when `webhook_signing_key` is set in tfvars. Required if you use webhook notifications, otherwise omit it. | +| `STORAGE_ENCRYPTION_KEY` | `{app_name}-storage-encryption-key` | Optional | Only created when `storage_encryption_key` is set in tfvars. Encrypts sensitive data at rest in Redis. Strongly recommended for production. Must be base64-encoded 32 bytes (`openssl rand -base64 32`). | + +The `lifecycle { ignore_changes = [secret_data] }` on secret versions means that once a secret is created, Terraform will not overwrite the value if you rotate it through `gcloud` or the Console. + +**Rotation procedure:** + +```bash +# Update the secret +echo -n "new-value" | gcloud secrets versions add \ + relayer-channels-relayer-api-key --data-file=- \ + --project=your-project + +# Force Cloud Run to pick up the new value +gcloud run services update relayer-channels-service \ + --region=us-east1 --project=your-project \ + --update-labels="redeploy=$(date +%s)" +``` + +### Production Reference Values + +If you are targeting OpenZeppelin's reference scale (about 2M+ tx/day), these are the env-var values to tune: + +```hcl +container_environment = [ + # Worker concurrency + { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, + + # API + plugin concurrency + { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, + { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, + { name = "MAX_CONNECTIONS", value = "4000" }, + + # Timeouts + { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, + { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, + { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, + { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, + + # Rate limits + { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, + + # Redis pools + { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, + { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, + + # Transaction cleanup + { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, + + # Contract-level pool isolation + { name = "LIMITED_CONTRACTS", value = "C,C" }, + { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, +] +``` + +### Environment-Based Defaults + +| Setting | Production | Non-production | +| --- | --- | --- | +| Min Cloud Run instances | 2 | 1 | +| Max Cloud Run instances | 10 | 4 | +| CPU always allocated | Yes | No | +| Redis tier | STANDARD_HA (failover) | BASIC | +| Redis memory | 5 GB | 1 GB | +| LB deletion protection | Enabled | Disabled | +| Log retention | 30 days | 7 days | + +--- + +## 7. Operational Playbook + +Day-2 operations: routine deploys, rollbacks, scaling, channel-pool management, and observability. For initial provisioning, see §5. + +### 7.1 Deploys + +Routine deploy (new container image): + +1. Build and push the new image to Artifact Registry (or update the remote repo tag). +2. Update `container_image` in tfvars to the new tag. +3. Run `terraform apply`. Cloud Run creates a new revision and routes traffic to it. + +### 7.2 Rollbacks + +Set `container_image` back to the previous tag and run `terraform apply`. Cloud Run keeps previous revisions available for instant rollback. + +### 7.3 Scaling + +Adjust in tfvars: + +```hcl +cpu = "4" +memory = "8Gi" +min_instance_count = 3 +max_instance_count = 20 +``` + +Running `terraform apply` applies the change without interruption. + +### 7.4 Channel-Pool Management + +```bash +# Add slots 201..400 +oz-channels bootstrap --from 201 --to 400 -p prod-mainnet + +# List current channels +oz-channels channels list -p prod-mainnet + +# Add or remove individual channels +oz-channels channels add channel-0050 -p prod-mainnet +oz-channels channels remove channel-0050 -p prod-mainnet +``` + +### 7.5 Monitoring Pub/Sub + +Check queue health in **GCP Console > Pub/Sub > Subscriptions > Metrics tab**: + +| Metric | Watch for | +| --- | --- | +| `num_undelivered_messages` | A growing backlog means processing is falling behind | +| `oldest_unacked_message_age` | Above 60s sustained means workers may be stuck | +| Pull/Ack operations | Healthy when messages are consumed as fast as they arrive | + +### 7.6 Monitoring Redis + +Check in **GCP Console > Memorystore > Instance > Monitoring tab**: + +| Metric | Watch for | +| --- | --- | +| CPU utilization | Spikes above 75% sustained | +| Memory usage | Climbing past 70% | +| Connected clients | Approaching the connection limit | + +### 7.7 Inspecting Transactions + +```bash +oz-relayer tx show -r channels-fund -p prod-mainnet --json +oz-relayer tx list -r channels-fund --status pending -p prod-mainnet +oz-relayer relayer balance channels-fund -p prod-mainnet +``` + +### 7.8 Observability + +The relayer emits structured JSON logs and Prometheus-format metrics. On GCP, these map to Cloud Logging and Cloud Monitoring. + +#### Cloud Logging + +Cloud Run streams `stdout` and `stderr` to Cloud Logging automatically. With `LOG_FORMAT=json`, the relayer produces structured entries with fields like `level`, `target`, `span.tx_id`, `span.relayer_id`, and `span.request_id`. + +Viewing logs: + +```bash +# Recent errors +gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' \ + --project=your-project --limit=20 --freshness=1h --format='value(textPayload)' + +# Filter by transaction ID +gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:""' \ + --project=your-project --limit=20 --freshness=1h + +# Live tail +gcloud logging tail 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service"' \ + --project=your-project +``` + +In the Console: Cloud Logging > Logs Explorer, then filter by `resource.type="cloud_run_revision"` and `resource.labels.service_name=""`. + +#### Cloud Monitoring Built-In Metrics + +Cloud Run and Pub/Sub emit metrics to Cloud Monitoring automatically, with no agent required. + +Cloud Run metrics (GCP Console > Cloud Run > Service > Metrics tab): + +| Metric | What it tells you | +| --- | --- | +| `run.googleapis.com/container/cpu/utilization` | CPU usage per instance. Sustained above 80% means scale up. | +| `run.googleapis.com/container/memory/utilization` | Memory usage. Sustained above 70% risks OOM. | +| `run.googleapis.com/request_count` | Request throughput by response code. Watch for 5xx spikes. | +| `run.googleapis.com/request_latencies` | p50/p95/p99 latency. Watch for degradation. | +| `run.googleapis.com/container/instance_count` | Active instances. Confirms autoscaling behavior. | +| `run.googleapis.com/container/startup_latencies` | Cold-start time. High values affect first-request latency. | + +Pub/Sub metrics (GCP Console > Pub/Sub > Subscription > Metrics tab): + +| Metric | What it tells you | +| --- | --- | +| `pubsub.googleapis.com/subscription/num_undelivered_messages` | Queue depth. A growing backlog means processing is falling behind. | +| `pubsub.googleapis.com/subscription/oldest_unacked_message_age` | How long the oldest message has waited. Above 60s sustained means workers may be stuck. | +| `pubsub.googleapis.com/subscription/pull_message_operation_count` | Pull throughput. Confirms workers are active. | +| `pubsub.googleapis.com/subscription/ack_message_operation_count` | Ack throughput. Confirms messages are being processed. | + +Memorystore metrics (GCP Console > Memorystore > Instance > Monitoring tab): + +| Metric | What it tells you | +| --- | --- | +| `redis.googleapis.com/stats/cpu_utilization` | Redis CPU. Spikes above 75% sustained need attention. | +| `redis.googleapis.com/stats/memory/usage_ratio` | Memory usage. Climbing past 70% means you should plan capacity. | +| `redis.googleapis.com/stats/connected_clients` | Connection count. Watch for approaching limits. | +| `redis.googleapis.com/stats/commands_processed` | Command throughput. Correlates with transaction volume. | + +#### Log-Based Metrics + +Create custom metrics from log patterns in **Cloud Logging > Log-based Metrics > Create Metric**: + +| Metric name | Filter | Purpose | +| --- | --- | --- | +| `relayer/errors` | `resource.type="cloud_run_revision" AND severity>=ERROR` | Total error rate | +| `relayer/pool_capacity` | `textPayload:"POOL_CAPACITY"` | Channel pool exhaustion events | +| `relayer/provider_paused` | `textPayload:"provider paused"` | RPC failover events | +| `relayer/tx_confirmed` | `textPayload:"confirmed"` | Transaction confirmation rate | + +Or through gcloud: + +```bash +gcloud logging metrics create relayer-errors \ + --project=your-project \ + --description="Relayer error count" \ + --log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' +``` + +#### Alerting + +Create alert policies in **Cloud Monitoring > Alerting > Create Policy**: + +| Alert | Metric | Condition | Severity | +| --- | --- | --- | --- | +| High error rate | `relayer/errors` (log-based) | More than 50 errors in 5 min | Critical | +| Cloud Run high CPU | `container/cpu/utilization` | Above 80% for 10 min | Warning | +| Cloud Run high memory | `container/memory/utilization` | Above 70% for 10 min | Warning | +| Pub/Sub backlog growing | `subscription/num_undelivered_messages` | Above 5000 for 10 min | Warning | +| Pub/Sub old messages | `subscription/oldest_unacked_message_age` | Above 300s for 5 min | Critical | +| Pool exhaustion | `relayer/pool_capacity` (log-based) | Above 0 in 5 min | Critical | + +Configure notification channels (email, Slack, PagerDuty) in **Cloud Monitoring > Alerting > Notification Channels**. + +#### Prometheus Metrics + +The relayer exposes Prometheus-format metrics on port `8081` at `/debug/metrics/scrape` (enabled by `METRICS_ENABLED=true`). When `enable_prometheus = true`, the Cloud Run service account has `monitoring.metricWriter` permissions for Google Cloud Managed Prometheus. + +To scrape these metrics: + +- Use Google Cloud Managed Prometheus with a sidecar collector. +- Run a self-hosted Prometheus instance that scrapes the Cloud Run service. +- Rely on the built-in Cloud Run metrics above for most operational needs. + +### 7.9 Stellar-Side Monitoring + +GCP metrics reflect service health. These signals reflect Stellar network health; monitor both. + +**Fund account balance:** + +```bash +oz-relayer relayer balance channels-fund -p prod-mainnet +``` + +Alert when the balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently — transactions submit but cannot be paid for. + +**Ledger close time:** Stellar closes a ledger roughly every 5 seconds under normal conditions. Sustained close times above 10 seconds indicate network stress; settlement latency will exceed the assumptions used in your channel pool sizing. Query Horizon to check: + +```bash +curl -sS "https://horizon.stellar.org/ledgers?order=desc&limit=5" | jq '._embedded.records[] | {sequence, closed_at}' +``` + +**`TRY_AGAIN_LATER` in logs:** Horizon is rejecting transactions due to fee competition. This is a Stellar congestion event, not a service failure. Raise `MAX_FEE` (see §10.7). If `TRY_AGAIN_LATER` appears alongside `provider paused`, check RPC provider health first — an unresponsive provider can force retries against a congested fallback. + +**RPC provider health:** Confirm both endpoints are reachable: + +```bash +curl -sS -X POST \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq . +``` + +--- + +## 8. Debugging Guide + +### How to Think About Errors + +Almost every failure in this system belongs to one of several layers, and the fastest way to debug is to decide which layer owns the symptom before you start reading logs. A request travels from the edge (Cloudflare) to the load balancer, to Cloud Run, into the Channels plugin. The plugin then talks to Redis, Pub/Sub, Cloud KMS, and the Stellar RPC. A 5xx returned at the edge is a different problem from a transaction that was accepted, queued, signed, and then rejected by Horizon. + +So when something breaks, work in this order: + +1. **Where did it fail?** A request that never returns a `tx_id` failed before or during the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a `tx_id` but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll). +2. **What layer owns that step?** Match it to a component: auth and rate limits live at the edge and the relayer API, sequence and channel contention live in Redis and the plugin, signing lives in KMS, and the final accept or reject comes from the RPC and Horizon. +3. **Pull the logs for that layer** using the entry points below, then match against the common patterns. + +The point of this ordering is to avoid reading the wrong logs. Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside, but each one lives in a different layer and has a different fix. + +### Entry Points + +| You have | Start with | +| --- | --- | +| Transaction ID | `oz-relayer tx show -r channels-fund --json -p ` | +| Error message | Search Cloud Logging for the error pattern | +| Time window | `gcloud logging read` with `--freshness` | +| Stellar tx hash | Query Horizon, then work backwards to the relayer's tx record | +| "What's failing right now" | Filter logs by `severity>=ERROR` | + +### Common Log Patterns + +| Pattern | What it means | +| --- | --- | +| `provider paused` | RPC failover triggered | +| `sequence`, `counter` | Sequence-number drift or contention | +| `POOL_CAPACITY` | Channel-account pool exhausted | +| `LOCKED_CONFLICT` | Two workers tried to acquire the same channel | +| `TRY_AGAIN_LATER` | Horizon-side throttling | + +### Redis Inspection + +Connect from a VM in the same VPC: + +```bash +redis-cli -h -p +KEYS *tx:* +GET "oz-relayer:relayer:channels-fund:tx:" +``` + +--- + +## 9. Security Model + +Covers secrets handling, network isolation, IAM role assignments, TLS posture, and KMS key management. Review before modifying IAM bindings or network ingress settings. + +### 9.1 Secrets Handling + +All secrets are stored in Secret Manager. They are currently passed as plain environment variables to Cloud Run. See Known Issues for the plan to switch to `secret_key_ref` references. + +### 9.2 Network Isolation + +- **Cloud Run ingress:** restricted to internal plus load balancer traffic (`INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER` in production, `INGRESS_TRAFFIC_ALL` for testing). +- **Cloud Run egress:** a VPC Connector with `PRIVATE_RANGES_ONLY`. Private traffic goes through the VPC (to Memorystore), and public traffic (Stellar RPC, KMS API) goes direct. +- **Memorystore:** reachable only through Private Service Access (VPC peering). No public IP. +- **Pub/Sub:** IAM-scoped. Only the Cloud Run service account has publisher and subscriber access to the relayer's topics. + +### 9.3 IAM Least-Privilege + +The Cloud Run service account (`{app_name}-run`) has: + +| Role | Scope | Purpose | +| --- | --- | --- | +| `secretmanager.secretAccessor` | Per-secret | Read secrets at startup | +| `monitoring.metricWriter` | Project | Write custom metrics | +| `logging.logWriter` | Project | Write application logs | +| `monitoring.viewer` | Project | Read Pub/Sub backlog depth | +| `cloudkms.signerVerifier` | Per-key | Sign transactions | +| `cloudkms.publicKeyViewer` | Per-key | Read the public key | +| `pubsub.publisher` | Per-topic | Publish job messages | +| `pubsub.subscriber` | Per-subscription | Pull and ack messages | +| `artifactregistry.reader` | Per-repository | Pull container images | + +### 9.4 TLS Posture + +- **Load balancer:** Google-managed SSL certificate, HTTPS on 443, HTTP redirects to HTTPS. +- **Memorystore:** transit encryption is disabled, since Private Service Access provides network-level isolation. Enable it if your compliance requirements call for it and the relayer binary supports TLS (see Known Issues). +- **Cloudflare to LB:** set the Cloudflare zone SSL mode to "Full" for end-to-end TLS. + +### 9.5 Cloud KMS for Stellar Signers + +- **Key algorithm:** `EC_SIGN_ED25519` (the Stellar-compatible ED25519 curve). +- **Protection level:** `SOFTWARE`. HSM is also supported but adds latency. +- **IAM:** the Cloud Run SA has `signerVerifier` and `publicKeyViewer` on the key. +- **Rotation:** provision a new key, register a new signer and relayer, fund the new on-chain account, drain the old one, then retire it. + +--- + +## 10. Key Gotchas + +Operational sharp edges encountered in production deployments. Each item describes a failure mode, its cause, and the fix. + +### 10.1 Channel-Account Exhaustion (`POOL_CAPACITY`) + +Sizing formula: + +``` +min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor) +``` + +At about 23 TPS sustained, with roughly 5s Stellar settlement and a 1.5x safety factor: `23 x 5 x 1.5 = 173` channels minimum. + +Recovery: `oz-channels bootstrap --from --to `. + +### 10.2 SSL Certificate Provisioning + +Google-managed certificates need DNS to point at the LB IP before they provision. With Cloudflare enabled, you have to temporarily point DNS straight at the LB IP (bypassing the Cloudflare proxy), wait for the cert to become ACTIVE, then switch to the Cloudflare CNAME. + + +If the cert is stuck in `FAILED_NOT_VISIBLE` for more than 30 minutes, it usually needs to be recreated. Bump the cert name suffix in `load-balancer.tf` (for example `-cert-v2` to `-cert-v3`) and re-apply. The `create_before_destroy` lifecycle provisions the new cert before removing the old one, so there is no downtime. + + +### 10.3 VPC Connector CIDR Overlap + +If you run multiple environments (stg and prod) in the same VPC, each one needs a unique `connector_ip_cidr_range` (for example `10.8.0.0/28` for stg and `10.9.0.0/28` for prod). + +### 10.4 Private Service Access (Shared Connection) + +A VPC can hold only one Private Service Access connection to `servicenetworking.googleapis.com`. If stg creates it first, prod's apply will fail unless `update_on_creation_fail = true` is set on the `google_service_networking_connection` resource. The module handles this. + +### 10.5 Pub/Sub Topic Prefix and Image Compatibility + +The `PUBSUB_TOPIC_PREFIX` env var has to match what the container image expects. Different image versions may or may not append a trailing dash to the prefix. If you see "topic does not exist" errors with double dashes (`relayer-mainnet-prod--`), remove the trailing dash from the prefix. If topics are missing entirely (no dash), add it back. + +### 10.6 STORAGE_ENCRYPTION_KEY Format + +The encryption key has to be base64-encoded 32 bytes (44 characters with `=` padding). Generate it with `openssl rand -base64 32`. Hex-encoded keys fail silently with "Invalid key length: expected 32 bytes, got 0". + +### 10.7 Fee-Bump Tuning Under Congestion + +Set this through the `MAX_FEE` env var (default `1000000` stroops, which is 0.1 XLM). Under network congestion, raise it to `10000000` (1 XLM). The Channels plugin uses static fees, so it does not dynamically bump on `INSUFFICIENT_FEE`. + +--- + +## 11. Terraform Variables Reference + +Complete listing of all module variables. Required variables must be set in `terraform.tfvars`; optional variables document their module defaults here. + +### Required + +| Name | Type | Description | +| --- | --- | --- | +| `project_id` | `string` | GCP project ID | +| `region` | `string` | GCP region (for example `us-east1`) | +| `environment` | `string` | Deployment environment (`prod`, `stg`). 1 to 16 chars. | +| `network` | `string` | VPC network name or self_link | +| `subnetwork` | `string` | Subnet name or self_link | +| `domain_name` | `string` | FQDN for the service | +| `container_image` | `string` | Container image URI | +| `relayer_api_key` | `string` | Relayer API key (sensitive) | +| `channels_admin_secret` | `string` | Admin secret (sensitive) | + +### Optional, Core + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `app_name` | `string` | `"relayer-channels"` | Resource name prefix | +| `name_suffix_environment` | `bool` | `true` | Append `-{env}` to names (auto-off for prod) | +| `labels` | `map(string)` | `{}` | Labels for all resources | + +### Optional, Networking + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `connector_machine_type` | `string` | `"e2-micro"` | VPC connector machine type | +| `connector_min_instances` | `number` | `2` | Min connector instances | +| `connector_max_instances` | `number` | `3` | Max connector instances | +| `connector_ip_cidr_range` | `string` | `"10.8.0.0/28"` | CIDR for the VPC connector (/28, must not overlap) | + +### Optional, Container / Cloud Run + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `container_port` | `number` | `8080` | Container port | +| `cpu` | `string` | `"1"` | CPU allocation (`"1"`, `"2"`, `"4"`) | +| `memory` | `string` | `"2Gi"` | Memory allocation | +| `min_instance_count` | `number` | `null` | Min instances. Auto: 2 (prod), 1 (non-prod) | +| `max_instance_count` | `number` | `null` | Max instances. Auto: 10 (prod), 4 (non-prod) | +| `cpu_always_allocated` | `bool` | `null` | Always allocate CPU. Auto: true (prod) | +| `health_check_path` | `string` | `"/api/v1/health"` | Probe path | +| `container_environment` | `list(object)` | `[]` | Additional env vars (user overrides win) | + +### Optional, Application + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `stellar_network` | `string` | `"testnet"` | `mainnet` or `testnet` | +| `fund_relayer_id` | `string` | `"channels-fund"` | Fund relayer ID | +| `distributed_mode` | `bool` | `true` | Enable distributed queue processing | +| `queue_backend` | `string` | `"pubsub"` | `pubsub` (recommended) or `redis` | +| `log_level` | `string` | `"warn"` | Application log level | + +### Optional, Secrets + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `webhook_signing_key` | `string` | `""` | Webhook signing key (sensitive). Only set it if you use webhook notifications, otherwise omit it. | +| `storage_encryption_key` | `string` | `""` | Encrypts data at rest in Redis. Must be base64-encoded 32 bytes (sensitive). Strongly recommended for production. | + +### Optional, Redis + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `redis_tier` | `string` | `null` | `BASIC` or `STANDARD_HA`. Auto per environment. | +| `redis_memory_size_gb` | `number` | `null` | Memory in GB. Auto: 5 (prod), 1 (non-prod). | +| `redis_version` | `string` | `"REDIS_7_2"` | Redis version | + +### Optional, Cloudflare + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `enable_cloudflare` | `bool` | `false` | Enable the Cloudflare Workers gateway | +| `cloudflare_zone_id` | `string` | `""` | Required when Cloudflare is enabled | +| `cloudflare_account_id` | `string` | `""` | Required when Cloudflare is enabled | +| `relayer_static_api_key` | `string` | `""` | Static API key injected by the Worker upstream (sensitive). Use the same value as `relayer_api_key`. | +| `key_salt` | `string` | `""` | Salt for hashing user API keys before storing in KV (sensitive). Generate with `openssl rand -base64 32`. | +| `gen_ip_rate_hour` | `number` | `2` | Max `/gen` per IP per hour | +| `relay_rpm_per_key` | `number` | `60` | Max relay RPM per key | + +### Optional, Load Balancer + +| Name | Type | Default | Description | +| --- | --- | --- | --- | +| `lb_deletion_protection` | `bool` | `null` | Auto: true (prod), false (non-prod) | +| `lb_log_sample_rate` | `number` | `0` | Request log sampling (0 disables it) | + +### Outputs + +| Name | Description | +| --- | --- | +| `cloud_run_service_name` | Cloud Run service name | +| `cloud_run_service_uri` | Cloud Run service URI (internal) | +| `cloud_run_service_account_email` | Cloud Run service account email | +| `load_balancer_ip` | Global static IP of the HTTPS LB | +| `domain_name` | Service domain name | +| `redis_host` / `redis_port` / `redis_read_endpoint` | Memorystore connection info | +| `pubsub_topics` / `pubsub_subscriptions` | Map of queue names to Pub/Sub resource names | +| `secret_ids` | Map of secret names to Secret Manager IDs | +| `kms_key_ring_name` / `kms_signing_key_name` / `kms_signing_key_id` | Cloud KMS key info | +| `artifact_registry_repository` / `artifact_registry_url` | Artifact Registry info | +| `cloudflare_worker_name` | Worker name (null if disabled) | + +--- + +## 12. Known Issues + +Tracked limitations with current workarounds. These are active constraints, not historical bugs. + +### Memorystore Redis TLS + +Transit encryption is disabled because the relayer binary is not compiled with TLS support for Redis connections. This is acceptable because Memorystore is reachable only through Private Service Access (VPC peering), so traffic never leaves Google's network. + +### Secret Manager References + +Secrets are currently passed as plain environment variables to Cloud Run instead of using `secret_key_ref` Secret Manager references. This is a workaround for a 0-byte issue hit during the initial deployment. The plan is to switch back to Secret Manager references for a better security posture. diff --git a/src/navigation/stellar.json b/src/navigation/stellar.json index 867bb3bd..3b2e93d7 100644 --- a/src/navigation/stellar.json +++ b/src/navigation/stellar.json @@ -516,6 +516,11 @@ "type": "page", "name": "Stellar X402 Facilitator Guide", "url": "/relayer/1.5.x/guides/stellar-x402-facilitator-guide" + }, + { + "type": "page", + "name": "GCP Operator Deployment Guide", + "url": "/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide" } ] }, From 03936e4d55b5215b54262ba2ef245d3f9b7de332 Mon Sep 17 00:00:00 2001 From: stevep0z <255929980+stevep0z@users.noreply.github.com> Date: Tue, 23 Jun 2026 17:48:38 -0500 Subject: [PATCH 2/4] chore: add Stellar Channels AWS operator deployment guide Add step-by-step operator deployment guide for running the hosted Stellar Channels service on AWS (ECS Fargate, ElastiCache, SQS, KMS, SSM). Includes architecture diagrams, configuration reference, operational playbook, debugging guide with sync/async path mental model, security model, and gotchas. Update nav to list the AWS guide alongside the existing GCP guide. --- .../stellar-relayer-aws-operator-guide.mdx | 1644 +++++++++++++++++ src/navigation/stellar.json | 5 + 2 files changed, 1649 insertions(+) create mode 100644 content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx diff --git a/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx b/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx new file mode 100644 index 00000000..51934d36 --- /dev/null +++ b/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx @@ -0,0 +1,1644 @@ +--- +title: 'Hosted Stellar Relayer on AWS: Operator Deployment Guide' +--- + +A step-by-step guide for infrastructure teams (such as Blockdaemon or SDF) deploying a hosted Stellar relayer service that mirrors OpenZeppelin's existing production setup. + +**Audience:** infrastructure operators who have run production AWS workloads but are new to OpenZeppelin's relayer stack. + +**Outcome:** a hosted Stellar Channels service in your own AWS account capable of serving the same workload profile OpenZeppelin currently runs (roughly 2M+ transactions per day across roughly 2,500 relayers). + +For the GCP deployment, see the [GCP Operator Deployment Guide](/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide). + +--- + +## 1. Overview + +OpenZeppelin currently runs a hosted Stellar relayer service at `channels.openzeppelin.com` (mainnet) and `channels.openzeppelin.com/testnet` (testnet). The service absorbs the operational complexity of parallel Stellar transaction submission (channel-account pool management, fee bumping, sequence-number arbitration, multi-RPC failover) and exposes a simple HTTP API to downstream callers. + +This guide is for infrastructure teams deploying a hosted relayer service for SDF providing the same throughput as OpenZeppelin. Blockdaemon is the first such operator; this guide is written to be portable to others. + +### What You Will End Up With + +After following this guide, you will have: + +- A production-ready hosted Stellar Channels service running in your own AWS account, exposed at a domain you control (for example, `channels.your-company.com`). +- An ECS Fargate-backed compute tier with autoscaling, fronted by an Application Load Balancer with TLS 1.3. +- ElastiCache Redis (in production: multi-AZ with failover) for state and rate-limit accounting. +- Eight SQS queues and DLQs handling the distributed transaction-processing pipeline. +- Optional Cloudflare Worker fronting the ALB for self-serve API-key issuance (the `/gen` flow), per-user rate limiting, and usage analytics. +- AWS SSM Parameter Store SecureString entries for every secret. No secrets in environment variables, no secrets in container images. +- **Observability:** CloudWatch Logs and CloudWatch Metrics by default. Optionally, an Amazon Managed Prometheus workspace that remote-writes the same metric set if you operate your own Grafana or alerting stack. +- **Alerting:** CloudWatch Alarms wired to SNS topics that fan out to PagerDuty (or your on-call channel of choice). The module provisions the alarm resources but leaves `alarm_actions` empty by default so you bind the SNS topic ARNs that route to your existing incident pipeline. +- Optional Lambda functions for fund-relayer balance monitoring and ECS auto-restart on alarm. + +The system handles two transaction-submission modes: + +- **Signed XDR mode:** the caller signs a complete Stellar transaction envelope and submits it; the service only handles fee-bumping and submission. +- **Soroban `func` + `auth` mode:** the caller submits a Soroban host function and authorization entries; the service assembles, simulates, signs with a channel account, fee-bumps, and submits. + +### What This Guide Assumes You Already Have + +- Strong AWS infrastructure background (VPC, ECS, ALB, IAM, Route53, ACM, ElastiCache, SQS). +- Terraform fluency (1.5.0 or later). +- A target AWS account where you can create the full resource set, or an account-pair pattern with Route53 in a separate account (cross-account assume-role is supported). +- A domain in Route53 you control. +- (Optional) A Cloudflare account if you want the `/gen` API-key gateway. + +If you are looking for your own development or any other use cases which serve lower throughput, see the upstream Stellar Operator Guide (different audience,, different deployment shape. + +--- + +## 2. Architecture + +### Cloud Architecture + +```mermaid +flowchart TD + Callers([Public callers]) + + subgraph Edge["Edge (Cloudflare, optional)"] + Worker["Cloudflare Worker
• /gen + /testnet/gen — issues API keys
• KV-backed auth, hashes with KEY_SALT
• per-IP / per-key rate limits
• rewrites Bearer→static, sets x-consumer-key
• usage tracking via Analytics Engine"] + end + + subgraph AWSEdge["AWS Edge"] + ALB["Application Load Balancer
TLS 1.3 · HTTPS-only · HTTP→HTTPS redirect
ingress restricted to Cloudflare IPs"] + end + + subgraph Compute["Compute"] + Tasks["ECS Fargate Service
relayer container · autoscaling 2..N tasks
health: /api/v1/health · optional CW sidecar"] + end + + subgraph State["Data plane"] + Redis[("ElastiCache Redis
multi-AZ failover")] + SQS[("SQS — 8 queues + DLQs")] + SSM[("SSM Parameter Store
SecureString secrets")] + end + + subgraph Observability["Observability"] + CW["CloudWatch alarms
queue depth · DLQ age · ECS health"] + end + + Stellar([Stellar RPC
Soroban + Horizon]) + ECR[(ECR — image source)] + + Callers --> Worker + Worker -->|"Bearer = static-key
x-consumer-key = user-key"| ALB + ALB --> Tasks + Tasks --> Redis + Tasks --> SQS + Tasks --> SSM + Tasks --> Stellar + ECR -.->|image pull| Tasks + SQS -.-> CW + Tasks -.-> CW +``` + +**Module:** the entire stack above is provisioned by the `relayer-channels` Terraform module in `OpenZeppelin/relayer-channels-infra`. Operators consume it either by cloning the repo (standalone mode) or referencing it as an external module from their own Terraform. + +**Components at a Glance:** + +| Component | AWS resource | Purpose | +| --- | --- | --- | +| Edge gateway | Cloudflare Worker + KV namespace (optional) | API-key issuance, per-key rate limiting, usage tracking, static-key injection upstream | +| Load balancer | Application Load Balancer + ACM cert | TLS termination, HTTPS-only, health-checked routing to Fargate | +| Compute | ECS Fargate Service (`launch_type = "FARGATE"`) | Runs the relayer container (and optional metrics sidecar). Autoscaling by CPU. | +| State | ElastiCache Redis 7.1 replication group, in-transit TLS | Relayer state (transaction records, sequence counters), distributed locks, rate-limit buckets | +| Queue | 8 SQS standard queues + 8 DLQs | Distributed transaction processing (request → submit → status check → notification, etc.) | +| Secrets | SSM Parameter Store `SecureString` | `API_KEY`, `PLUGIN_ADMIN_SECRET`, `WEBHOOK_SIGNING_KEY`, `STORAGE_ENCRYPTION_KEY` | +| Observability | CloudWatch Logs + Metrics + (optional) Amazon Managed Prometheus | App logs (JSON format), per-queue depth alarms, optional metrics-remote-write | +| Image registry | ECR Public (module-created) or your own ECR | Container image source for the Fargate task | +| Signing | KMS (out-of-module, operator-provisioned per relayer-side signer config) | ED25519 keys for the fund relayer + channel-account signers | +| Optional monitors | Lambda + EventBridge | Balance-check Lambda; ECS restart-on-alarm Lambda | + +### App Architecture (Channels Plugin Runtime) + +```mermaid +flowchart TD + Client([API Client]) + + subgraph Relayer["Relayer API (openzeppelin-relayer)"] + Auth["Bearer auth (API_KEY from SSM)
+ rate-limit middleware
+ route to plugin"] + end + + subgraph Plugin["Channels Plugin Runtime"] + Pipeline["Submission pipeline
1. Validation — auth entries, payload, scheme
2. ChannelPool — acquire a channel relayer
3. Build + Simulate — assemble Soroban tx
4. Sign + FeeBump — channel signs; fund FeeBumps
5. Submit + Wait — POST to RPC, poll status"] + Mgmt["Management API
setChannelAccounts / listChannelAccounts
setFeeLimit / getFeeUsage / getFeeLimit"] + end + + Redis[("Redis
state")] + SQS[("SQS
jobs")] + Accts[("Fund acct
+ channel accts
(KMS-backed)")] + Stellar([Stellar RPC]) + + Client -->|"POST /api/v1/plugins/channels/call
body: { params: { xdr } } OR { params: { func, auth } }"| Auth + Auth --> Pipeline + Auth --> Mgmt + Pipeline <--> Redis + Pipeline <--> SQS + Mgmt <--> Redis + Pipeline -->|sign| Accts + Accts -->|signed envelope| Stellar + Pipeline -->|submit + poll| Stellar +``` + +**Source references:** +- Relayer API: `openzeppelin-relayer` +- Channels Plugin: `relayer-plugin-channels` (see `src/plugin/` for the runtime, `src/client/` for the TypeScript SDK) +- The Docker image deployed to Fargate is built from `openzeppelin-relayer/examples/channels-plugin-example` + +### Transaction Lifecycle + +End-to-end flow for a Soroban `func` + `auth` submission through the hosted service. + +```mermaid +sequenceDiagram + autonumber + actor Caller + participant CF as CF Worker + participant ALB + participant API as Relayer API + participant Plugin as Channels Plugin + participant Redis + participant SQS + participant KMS + participant RPC as Soroban RPC + + Caller->>CF: POST / · Bearer user-key + CF->>CF: hash + KV lookup
+ scope check + CF->>ALB: rewrite Bearer→static-key
set x-consumer-key=user-key + ALB->>API: TLS terminate · forward + API->>Plugin: route /plugins/channels/call + Plugin->>Redis: check fee budget + Plugin->>Redis: persist tx record + Plugin->>SQS: enqueue transaction-request + Plugin-->>Caller: 202 Accepted + tx_id + + rect rgba(200, 220, 255, 0.4) + Note over Plugin,RPC: Async worker pickup (after 202 returns) + Plugin->>Redis: acquire channel account + Plugin->>RPC: build + simulate tx + RPC-->>Plugin: assembled envelope + Plugin->>KMS: sign w/ channel signer + KMS-->>Plugin: signature + Plugin->>KMS: fee-bump w/ fund signer + KMS-->>Plugin: fee-bumped envelope + Plugin->>RPC: submit signed envelope + RPC-->>Plugin: submitted (no hash yet) + Plugin->>SQS: enqueue status-check-stellar + + loop until confirmed or expired + Plugin->>RPC: GET tx by hash + RPC-->>Plugin: pending / confirmed + end + + Plugin->>Redis: update tx record → confirmed + Plugin->>SQS: enqueue notification + end + + Plugin->>Caller: webhook (signed with WEBHOOK_SIGNING_KEY) +``` + +**What Each Stage Costs:** + +| Stage | Latency contributors | +| --- | --- | +| CF Worker auth | KV lookup (~10ms) + sha256 hash | +| ALB to Fargate | TLS termination + intra-VPC hop (~1-5ms) | +| Validation | Redis lookups for fee budget (~1ms each) | +| Channel acquire | Redis distributed lock (~1ms; queue wait if pool exhausted) | +| Build + simulate | Soroban RPC `simulateTransaction` (~50-200ms) | +| Sign + fee-bump | KMS `Sign` × 2 (channel + fund) (~10-50ms each, region-local) | +| Submit | Soroban RPC `sendTransaction` (~10-100ms) | +| Status check | Per-poll RPC call (~10-50ms); ledger settlement adds ~5s base | + +The 202 response is returned synchronously; the rest happens asynchronously via SQS workers. Status is queryable any time via `oz-relayer tx show `. + +### Capacity Profile + +The reference deployment OpenZeppelin runs handles a **growing ~3M transactions per day** sustained, served by **~1,000 relayers** (fund and channel-account entities combined). Two recent windows: + +| Window | Total tx (7d) | Daily avg | Sustained tx/s | Peak day | Peak tx/s | +| --- | --- | --- | --- | --- | --- | +| Apr 28 – May 4 | 19.19M | 2.74M | ~31.7 | May 4 (3.90M) | ~45 | +| **May 5 – May 11** | **20.88M (+8.8% WoW)** | **2.98M** | **~34.5** | **May 8 (3.67M)** | **~42.5** | + +The deployment is **trending up** WoW (+8.8% in the most recent window) and routinely absorbs daily peaks **~25–30% above the 7-day average**. Plan for headroom; autoscaling minimums should comfortably cover the **peak day**, not the average. + +**Traffic concentration.** In both windows, **~99%+ of all transactions terminate at a small set of high-volume Soroban contracts** registered in `LIMITED_CONTRACTS`. The top contract alone accounts for **73–97% of daily volume** depending on its onchain phase, with the second contributing most of the remainder. Non-limited contracts are below sampling resolution (0.2% or less). This is what the contract-capacity-ratio knob is sized against. The full env-var tuning is in section 6. + +**The Terraform module defaults are sized for `environment = "prod"` workloads but tuned conservatively.** For reference, here is the actual production configuration OpenZeppelin runs at this scale (sanitized): + +| Resource | Module default (per `relayer-channels-infra`) | OpenZeppelin production (actual resource capacity) | +| --- | --- | --- | +| ECS task CPU | 1024 (1 vCPU) | **8192 (8 vCPU)** | +| ECS task memory | 2048 MiB | **16384 MiB** | +| ECS desired count | 2 | **11** | +| ECS autoscaling min | 2 | **11** | +| ECS autoscaling max | 10 | **25** | +| Container CPU (within task) | 1024 | **6144** (rest reserved for sidecars) | +| Container memory (within task) | 2048 | **9216** | +| Redis node type | `cache.t4g.medium` (non-prod) / `cache.r7g.large` (prod) | `cache.r7g.large` family (multi-AZ failover) | +| Redis pool max size | 500 | **3000** | +| Redis reader pool max size | 1000 | **3000** | +| Max connections (relayer) | 256 | **4000** | +| Rate limit (req/sec) | 100 | **400** | +| Rate limit burst | 300 | **500** | + +The module defaults are operationally fine for a new deployment ramping up; expect to grow into something closer to the production shape as your workload approaches the OpenZeppelin scale. The full env-var tuning is in section 6. + +**Sidecar pattern:** OpenZeppelin runs a `cloudwatch-exporter` sidecar in every task that scrapes `:8081/debug/metrics/scrape` and pushes Prometheus metrics into CloudWatch under namespaces like `RelayerChannelsMainnetTransactions`. The module exposes this as an optional feature via `enable_cloudwatch_exporter` and `cloudwatch_exporter_image`. + + +Traffic figures above are drawn from internal traffic analysis covering Apr 28 – May 11 on the channels-fund mainnet deployment (CloudWatch + per-contract instrumentation). Refresh this section quarterly or whenever a new WoW snapshot is generated. + + +### SQS Queue Topology + +The relayer's distributed processing layer is eight SQS standard queues, each backed by a Dead Letter Queue. Producers, consumers, and DLQ relationships: + +```mermaid +flowchart TD + subgraph Producers["Producers"] + APIReq[API request] + WorkerCb[Worker callback] + Cron[Cron sweep] + Health[Health probe] + end + + subgraph MainQueues["8 SQS main queues — per-queue tuning"] + Q1["transaction-request
vis 300s · max-recv 6"] + Q2["transaction-submission
vis 120s · max-recv 2 ⚠"] + Q3["status-check
vis 300s · max-recv 1000"] + Q4["status-check-evm
vis 300s · max-recv 1000"] + Q5["status-check-stellar
vis 300s · max-recv 1000"] + Q6["notification
vis 180s · max-recv 6"] + Q7["token-swap-request
vis 300s · max-recv 6"] + Q8["relayer-health-check
vis 300s · max-recv 6"] + end + + Workers["ECS Fargate workers
One pool per JobType
Concurrency: BACKGROUND_WORKER_*_CONCURRENCY"] + + DLQs[("8 Dead Letter Queues
7-day retention
Inspect: aws sqs receive-message
Re-drive: aws sqs start-message-move-task")] + + Alarms["CloudWatch Alarms (3 per queue, 24 total)
• <prefix>-<queue>-high-depth (5k or 10k)
• <prefix>-<queue>-dlq-messages (>100)
• <prefix>-<queue>-old-messages (vis × 3)
alarm_actions=[] by default — wire to SNS/PD"] + + Producers --> MainQueues + MainQueues -->|normal consume| Workers + MainQueues -. exceeded max-recv .-> DLQs + Workers -. enqueue follow-up .-> MainQueues + MainQueues -.-> Alarms + DLQs -.-> Alarms +``` + +**Why the Per-Queue Tuning Matters:** + +- `transaction-submission` has `max-recv = 2` because a failed RPC submission should not retry indefinitely (retrying a maybe-submitted tx risks double-spend semantics on the chain). Two attempts, then DLQ for human inspection. +- `status-check-*` queues have `max-recv = 1000` because status polling is *expected* to retry many times before a transaction confirms. Long ledger settlement means many poll attempts. A DLQ entry here means the tx never confirmed despite ~1000 polling rounds. +- Other queues sit at `max-recv = 6`, a reasonable default for retriable transient failures. + +--- + +## 3. Prerequisites + +### Accounts and Access + +- **AWS account** with permissions to create: ECS clusters/services, ECR Public repositories, Application Load Balancers, ACM certificates, Route53 records, ElastiCache replication groups, SQS queues, IAM roles/policies, SSM Parameter Store, CloudWatch Logs + Metrics + Alarms, Lambda functions, EventBridge rules, Amazon Managed Prometheus workspaces. (Cross-account variants supported via assume-role; see section 5.) +- **Route53 hosted zone** for the domain you want to serve from (for example, `channels.your-company.com`). The zone can live in the same account or in a different one (cross-account assume-role). +- **(Optional) Cloudflare account** with a zone matching your domain. Required if you want the `/gen` API-key flow, per-IP/per-key rate limiting at the edge, and Cloudflare Analytics-Engine-backed usage tracking. +- **GitHub account** with access to the four reference repositories listed below. None of the repositories are required to be forked; you can consume Terraform modules directly via the `source` block. + +### Tooling + +| Tool | Version | Why | +| --- | --- | --- | +| Terraform | ≥ 1.5.0 | Module language constraints | +| AWS provider | < 6.0.0 | Pinned in `versions.tf` | +| Cloudflare provider | ~> 5.0 | Required as a provider even when `enable_cloudflare = false` (Terraform constraint) | +| Docker | recent stable | If building the container image rather than consuming the published one | +| AWS CLI v2 | recent stable | For ECR login, SSM updates, manual debugging | +| Node.js ≥ 18 + pnpm ≥ 10 | recent stable | If you intend to modify the Channels plugin (uncommon; most operators consume the published npm package) | + +### Stellar-Side Prerequisites + +- **Soroban RPC access:** at least two independent providers for mainnet (for example, Stellar Foundation plus a commercial provider). The module does not provision RPC; you configure RPC URLs at the relayer configuration layer. +- **KMS keys for signers:** for production, you will create one AWS KMS key per fund relayer (ED25519 key spec, asymmetric sign). The **fund relayer** is the Stellar account that signs and pays for the **fee-bump envelope** wrapping every channel-account submission: channel signers sign the inner transaction with their own keys, and the fund relayer's fee-bump signature is what commits XLM to confirm the bundle onchain. So every successful submission consumes (a) one channel-signer signature and (b) one fund-relayer signature plus its inclusion fee. Channel-account signers may use the encrypted local keystore pattern (development-only); for production, both should be KMS-backed. See Section 9 for the full security framing. +- **Initial XLM funding:** bootstrapping happens in two explicit steps: + 1. **Fund the fund relayer's Stellar account.** On **mainnet**, this is a manual one-time top-up sent from your treasury or an exchange to the fund relayer's address. On **testnet**, fund it via Friendbot. + 2. **Bootstrap channel accounts from that balance.** `oz-channels bootstrap --to N` creates `N` channel accounts and sends `--starting-balance` XLM (default **2 XLM**) to each, drawing from the fund relayer. + + **Sizing the fund-relayer balance:** + + - **Provisioning (one-time):** `2 XLM × N` channel accounts. A 1,000-channel pool requires **at minimum 2,000 XLM** in the fund relayer before `bootstrap` can run. + - **Operating buffer (ongoing):** covers fee bumps for live traffic. At ~34 tx/s sustained (per section 2) and a 100-stroop base fee, a multi-day buffer typically runs in the range of tens to hundreds of XLM depending on congestion-driven fee multipliers. Top up via the balance-monitoring Lambda described in section 7. + +### Reference Repositories + +You will refer to three repositories during deployment: + +| Repo | Role | Visibility | +| --- | --- | --- | +| `OpenZeppelin/relayer-channels-infra` | Primary Terraform modules | Public | +| `OpenZeppelin/openzeppelin-relayer`, `examples/channels-plugin-example` | The example used to build the Docker image | Public | +| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | Public | + +--- + +## 4. Environments + +OpenZeppelin's reference deployment uses three environments. We recommend operators mirror this separation: + +| Environment | Stellar network | AWS profile pattern | ECS cluster | Log group | +| --- | --- | --- | --- | --- | +| `prod-mainnet` | Stellar Mainnet | Production AWS account | `relayer-channels-prod-mainnet-cluster` | `/aws/ecs/relayer-channels-prod-mainnet/task` | +| `prod-testnet` | Stellar Testnet | Production AWS account | `relayer-channels-prod-testnet-cluster` | `/aws/ecs/relayer-channels-prod-testnet/task` | +| `stg` | Stellar Testnet | Staging AWS account | `relayer-channels-stg-cluster` | `/aws/ecs/relayer-channels-stg/task` | + +The cluster and log-group naming is auto-derived by the module from `app_name` + `environment`. When `environment = "prod"` the resource-name suffix is dropped; for other environments, names are suffixed with `-`. + +### Configuration Shape + +Operators typically maintain a small structured config that maps environment names to their AWS profile, region, ECS cluster, log group, and Stellar Horizon endpoint. This avoids hard-coding values in operational scripts and CI/CD. A reasonable shape: + +```yaml +environments: +prod-mainnet: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-prod-mainnet-cluster +log_group: /aws/ecs/relayer-channels-prod-mainnet/task +horizon: https://horizon.stellar.org +stellar_network: mainnet +relayer_id: channels-fund +prod-testnet: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-prod-testnet-cluster +log_group: /aws/ecs/relayer-channels-prod-testnet/task +horizon: https://horizon-testnet.stellar.org +stellar_network: testnet +relayer_id: channels-fund +stg: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-stg-cluster +log_group: /aws/ecs/relayer-channels-stg/task +horizon: https://horizon-testnet.stellar.org +stellar_network: testnet +relayer_id: channels-fund +``` + +This same shape is consumed by the operator CLIs (`oz-relayer`, `oz-channels`) described in section 7; they read profile config from `~/.config/oz-relayer/config.yaml` and `~/.config/oz-channels/config.yaml`. + +--- + +## 5. Step-by-Step Deployment + +This section walks the happy-path deployment using the Terraform module in standalone mode. After the first deploy, day-2 operations are described in section 7. + +### Step 5.1: Clone the Repository + +```bash +git clone https://github.com/OpenZeppelin/relayer-channels-infra.git +cd relayer-channels-infra +``` + + +Cloning the repo is useful for exploring the code and contributing. You can also consume the Terraform module directly without cloning, by referencing it via the `source` block from your own Terraform configuration. + + +The repo layout: + +``` +relayer-channels-infra/ +├── main.tf # Root module — instantiates the relayer-channels submodule +├── variables.tf # Root variables +├── outputs.tf # Root outputs +├── versions.tf # Terraform + provider versions +├── terraform.tfvars.example # Annotated example tfvars +└── modules/ + └── relayer-channels/ # The actual module + ├── main.tf + ├── ecs.tf + ├── sqs.tf + ├── redis.tf + ├── cloudflare.tf + ├── dns.tf + ├── lambda.tf + ├── prometheus.tf + ├── worker.mjs # Cloudflare Worker source + ├── relayer_balance.mjs # Balance-check Lambda source + └── restart_ecs_on_alarm.mjs # ECS-restart Lambda source +``` + +### Step 5.2: Configure Terraform Backend + +In `versions.tf`, uncomment and edit the `backend "s3"` block so Terraform state is stored remotely (do not store state on a laptop in production). Example: + +```hcl +terraform { + required_version = ">= 1.5.0" + + backend "s3" { + bucket = "your-org-terraform-state" + key = "relayer-channels/prod-mainnet.tfstate" + region = "us-east-1" + dynamodb_table = "your-org-terraform-locks" + encrypt = true + } +} +``` + +Initialize: + +```bash +terraform init +``` + +### Step 5.3: Create Your tfvars + +Copy the example: + +```bash +cp terraform.tfvars.example terraform.tfvars +``` + +The example file (annotated) covers the full surface. The minimum set required for a first standalone deploy: + +```hcl +aws_region = "us-east-1" +environment = "prod-mainnet" # or "prod-testnet" / "stg" +vpc_id = "vpc-XXXXXXXXXXXXXXXXX" +vpc_cidr = "172.31.0.0/16" +public_subnet_ids = ["subnet-AAA", "subnet-BBB"] # 2+ AZs required + +domain_name = "channels.your-company.com" +route53_zone_name = "your-company.com" # OR route53_zone_id + +# Container image — either OpenZeppelin's published image or your own ECR +container_image = "public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2" # look up via the ECR Public Gallery + +# Secrets — never commit these. Set via TF_VAR_ or a secrets-managed CI pipeline +relayer_api_key = "" # required — set via TF_VAR_relayer_api_key +channels_admin_secret = "" # required — set via TF_VAR_channels_admin_secret + +# Stellar +stellar_network = "mainnet" +fund_relayer_id = "channels-fund" +distributed_mode = true # production: true; backed by SQS +log_level = "warn" +``` + +Pass secrets as environment variables to avoid them ever touching the working directory: + +```bash +export TF_VAR_relayer_api_key="$(openssl rand -hex 32)" # ≥ 32 chars; relayer enforces minimum length +export TF_VAR_channels_admin_secret="$(openssl rand -hex 32)" # admin secret for management API +export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)" # if using webhooks +export TF_VAR_storage_encryption_key="$(openssl rand -hex 32)" # for at-rest encryption in Redis +``` + +### Step 5.4: Decide on the Container Image Strategy + +Two options: + +**Option A: Consume OpenZeppelin's Published Image (recommended).** OpenZeppelin publishes pre-built images to ECR Public at `public.ecr.aws//openzeppelin-relayer-channels:` (look up the live alias from the ECR Public Gallery). The image bundles `openzeppelin-relayer` compiled from a pinned `main` revision with the `@openzeppelin/relayer-plugin-channels` package, runs as `nonroot` (UID 65532) on a Wolfi base, and ships with public Stellar RPC endpoints baked in (no secrets, no paid-RPC URLs). It is the same image OpenZeppelin runs in production. + +Tag scheme: + +| Tag pattern | Points at | +| --- | --- | +| `mainnet-` (for example `mainnet-1.4.2`) | Stellar mainnet build, pinned. **Use this in production.** | +| `mainnet-latest` | Most recent mainnet build. Convenient for dev; will move under you. | +| `testnet-` / `testnet-latest` | Stellar testnet equivalents. | + +```hcl +container_image = "public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2" +``` + +Build provenance and SBOM attestations are attached to every push. Verify with: + +```bash +docker buildx imagetools inspect \ + public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2 \ + --format '{{ json .Provenance }}' +``` + +The published image is built and pushed by OpenZeppelin's internal CI pipeline (one workflow per network: mainnet and testnet). The pipeline refuses to publish if any private-RPC pattern is detected in the baked `stellar.json`, which is the guardrail that makes the public image safe for downstream operators to consume. You don't need access to that pipeline to deploy; the published image at `public.ecr.aws//openzeppelin-relayer-channels:` is the contract. + +**Option B: Build Your Own Image** from the example and publish to your own ECR. Leave `container_image = ""` in tfvars; the module will create an ECR Public repository for you. Build steps mirror what the OpenZeppelin workflows do: + +```bash +git clone https://github.com/OpenZeppelin/openzeppelin-relayer.git +cd openzeppelin-relayer/examples/channels-plugin-example +# Build the Channels plugin TypeScript wrapper +(cd channel && pnpm install && pnpm run build) +# Build the Docker image +docker build -t my-relayer-channels:latest -f ../../Dockerfile.production . +``` + +You can run this build locally as shown, or wire it into the CI tool of your choice (GitHub Actions, GitLab CI, CircleCI, Buildkite, etc.). The build itself is a standard `docker build`; no OpenZeppelin-specific tooling is required to reproduce it. + +**What the public image baked-in config does NOT include** (you provide at runtime via mounted `config.json` and env vars): + +- Relayer definitions (`relayers[]`) +- Signer keys +- Redis (state and optional queue backend) +- SQS queues (if running in distributed mode) +- IAM roles for KMS / SQS / Secrets Manager / CloudWatch + +To override the baked public-RPC endpoints with paid/private RPCs, mount your own `stellar.json` at `/app/config/networks/stellar.json`. The file format mirrors the default: one entry per Stellar network (`mainnet`, `testnet`), each with a `rpc_urls[]` array of `{ url, weight }` objects. The relayer load-balances across the listed URLs by weight and rotates on failures (see the `RPC_*` and `PROVIDER_*` env vars in section 6 for failover tuning). + +After the first `terraform apply` (next step), push to the module-created ECR: + +```bash +aws ecr-public get-login-password --region us-east-1 \ + | docker login --username AWS --password-stdin public.ecr.aws + +# Tag using the ECR URL output by terraform +ECR_URL=$(terraform output -raw ecr_repository_url) +docker tag my-relayer-channels:latest "$ECR_URL:latest" +docker push "$ECR_URL:latest" +``` + +### Step 5.5: Plan and Apply + +```bash +terraform plan -out plan.tfplan +terraform apply plan.tfplan +``` + +Initial apply takes ~15–20 minutes (ElastiCache provisioning is the slowest leg; the ALB and ACM cert validation also take a few minutes each). + +Outputs include: + +| Output | Used for | +| --- | --- | +| `ecs_cluster_name` | Manual ECS Exec, AWS CLI scripting | +| `ecs_service_name` | Service updates, manual restart | +| `ecr_repository_url` | Container image pushes (Option B) | +| `alb_dns_name` | Direct ALB access if Cloudflare is disabled | +| `domain_name` | Public service URL | +| `redis_primary_endpoint`, `redis_reader_endpoint` | Manual Redis CLI access (via bastion) | +| `sqs_queue_urls` | Map of queue names to URLs for direct SQS inspection | +| `prometheus_endpoint` | If AMP enabled, the remote-write endpoint | +| `ssm_parameter_prefix` | Path prefix for SSM secret manipulation | + +### Step 5.6: Verify the Service Is Up + +```bash +# Health check +curl -sS https://channels.your-company.com/api/v1/health +# Expect: 200 OK with body "OK" + +# Readiness — checks Redis, queue, plugin +curl -sS https://channels.your-company.com/api/v1/ready +# Expect: 200 with JSON { "ready": true, "status": "healthy", ... } +``` + +If either fails, the same checks are available via CLI or the AWS Console: + +- **ECS service events:** + - CLI: `aws ecs describe-services --cluster --services ` + - Console: **ECS → Clusters → `` → Services → `` → Events**. +- **Task logs:** + - CLI: `aws logs tail "/aws/ecs/-/task" --since 10m --follow` + - Console: **CloudWatch → Log groups → `/aws/ecs/-/task` → Live tail**. +- **Target health (ALB):** + - Console: **EC2 → Target groups → `` → Targets**. Useful when health checks fail but tasks look running. + +The most common first-deploy failures are: secrets not set (relayer panics with `Security error: API_KEY must be at least 32 characters long`), Redis not yet healthy (wait ~5 more minutes), or ACM cert validation lag (verify the Route53 alias was created). + +### Step 5.7: Enable Cloudflare (If Using the `/gen` Flow) + +If you set `enable_cloudflare = true` in step 5.3, the module provisions: + +- A KV namespace named `-api-keys` for storing hashed user API keys +- A Cloudflare Worker named `-gateway` running `worker.mjs` +- A Workers route binding to `${domain_name}/*` so all traffic to your domain transits the Worker +- A 100%-traffic deployment strategy (no Workers-level canary) + +The Worker exposes: + +| Path | Method | Behavior | +| --- | --- | --- | +| `/gen` | GET | Issues a mainnet API key. Salt+hashes with `KEY_SALT`, stores in KV with 365d TTL, returns raw key once | +| `/testnet/gen` | GET | Same for testnet scope | +| `/` (POST) | POST | Proxies to relayer's `/api/v1/plugins/channels/call` (mainnet path) | +| `/testnet` (POST) | POST | Proxies to relayer's `/testnet/api/v1/plugins/channels/call` | +| `/usage/me` | GET | Queries Cloudflare Analytics Engine for the caller's usage | + +The Worker injects authentication headers upstream: the upstream `Bearer` token becomes the static API key (`RELAYER_STATIC_API_KEY`), while the user's original key becomes `x-consumer-key`. This is why the ECS module sets `API_KEY_HEADER=x-consumer-key`; the relayer is told to use that header for per-user fee tracking inside the Channels plugin. + +Rate limits (defaults; tunable via tfvars): + +| Variable | Default | Meaning | +| --- | --- | --- | +| `gen_ip_rate_hour` | 2 | Max `/gen` requests per IP per hour (anti-abuse) | +| `relay_rpm_per_key` | 60 | Max relay POSTs per minute per user key | + +The Worker source (`worker.mjs`) is identical between the public module and the internal OpenZeppelin deployment: same KV-backed auth, same header rewrites, same usage tracking via Cloudflare Analytics Engine. The module defaults for the rate limits (`gen_ip_rate_hour=2`, `relay_rpm_per_key=60`) are reasonable conservative starting points; tune to your traffic profile. + +**Worker Auth-Rewrite: What Actually Happens to Headers:** + +This is the non-obvious part of the Worker. Per-caller API keys go in, a single static API key goes out, and the original caller identity is carried in a different header for downstream fee tracking. + +```mermaid +flowchart TD + ClientReq["Client request
Authorization: Bearer <user-key>
POST / (or /testnet)"] + + subgraph Worker["Cloudflare Worker · worker.mjs"] + direction TB + Hash["1. Hash user key
keyHash = sha256(KEY_SALT : user-key)"] + KV{"2. KV lookup
env.API_KEYS.get('key:' + keyHash)"} + Auth401(["401 Unauthorized"]) + Rewrite["3. Rewrite headers
Authorization: Bearer <RELAYER_STATIC_API_KEY>
x-consumer-key: <user-key>"] + Path["4. Rewrite path
POST / → /api/v1/plugins/channels/call
POST /testnet → /testnet/api/v1/plugins/channels/call"] + Track["5. Track usage
Analytics Engine: indexes=keyHash · blobs=path,host · doubles=1"] + end + + Upstream["Upstream — Relayer
• Authorization: Bearer <static-key>
  → authenticates against API_KEY env
• x-consumer-key: <user-key>
  → plugin reads via API_KEY_HEADER
  → drives FEE_LIMIT per-caller tracking"] + + ClientReq --> Hash + Hash --> KV + KV -->|not found / inactive / wrong scope| Auth401 + KV -->|valid| Rewrite + Rewrite --> Path + Path --> Track + Track --> Upstream +``` + +**Operational consequence:** the upstream relayer never sees user-supplied keys directly. A compromised user key only compromises that user's quota; it cannot escalate to relayer-level admin operations because those require the static key (which only the Worker holds). + +### Step 5.8: Bootstrap the Channel-Account Pool + +The module deploys the *infrastructure* but does not provision the *channel accounts* (Stellar entities). For this you use the `oz-channels` CLI from `OpenZeppelin/ops-toolkit`. Install and configure a profile: + +```bash +# Install +git clone https://github.com/OpenZeppelin/ops-toolkit.git +cd ops-toolkit +bun install +cd packages/oz-channels && bun link +cd ../oz-relayer && bun link + +# Create a profile for your environment +oz-channels profile init prod-mainnet +# Prompts for: URL (your channels.your-company.com), API key, plugin ID (channels), +# admin secret, network (mainnet), test account +``` + +Then provision the pool. Start small on testnet to validate, then scale on mainnet: + +```bash +# Preview (no changes) +oz-channels bootstrap --to 200 --dry-run -p prod-mainnet + +# Provision +oz-channels bootstrap --to 200 -p prod-mainnet +``` + +The bootstrap workflow runs three phases: + +1. **Preflight audit** (parallel, configurable concurrency): checks each slot's signer existence, relayer existence, and onchain funding. +2. **Provisioning** (sequential): creates signers and relayers via the relayer's management API; tolerates 409s if records already exist. +3. **Funding** (sequential): submits funding transactions through the fund relayer using a competitive fee from Horizon `/fee_stats`; tolerates `op_already_exists`. + +After all three phases complete, the bootstrap merges the new accounts into the Channels plugin's pool via `setChannelAccounts`. + +**Workflow at a Glance:** + +```mermaid +flowchart TD + Start(["oz-channels bootstrap --to N -p <env>"]) + + P1["Phase 1: Preflight Audit (parallel)
For each slot 1..N:
• signer exists? (relayer API)
• relayer exists? (relayer API)
• on-chain funded? (Horizon)

Concurrency: --concurrency (default 10)
Gap detection across slot sequence"] + + P2["Phase 2: Provisioning (sequential)
For each slot missing signer/relayer:
• Create signer (random keypair)
• Create relayer pointing to signer

Idempotent: 409 Conflict tolerated
Delay between ops: --delay-ms (100ms)"] + + P3["Phase 3: Funding (sequential)
For each unfunded slot:
• GET Horizon /fee_stats for live fee
• Submit funding tx via --funding-relayer
  (default: channels-fund)
• --starting-balance XLM per account
  (default: 2)

Idempotent: op_already_exists tolerated"] + + Final["Final: setChannelAccounts
Merge new IDs into plugin's active pool
via management API.

Pool now ready for func+auth submissions."] + + AuditStop(["--audit stops here"]) + DryStop(["--dry-run stops here"]) + + Start --> P1 + P1 --> AuditStop + P1 --> DryStop + P1 --> P2 + P2 --> P3 + P3 --> Final + + Modes["Mode flags
--dry-run · preview only
--audit · report issues only
--allow-gaps · skip gap-detection
--verbose · per-slot output"] +``` + +**Production sizing reference:** the reference deployment runs ~1,000 relayers total. For a fresh deploy that needs to handle the full reference load, bootstrap several hundred channel accounts. For lower-load deployments, start with 50–100 and scale incrementally. + +Gap-detection guards against accidentally provisioning a sparse pool: + +```bash +# Will error if slots 11–19 don't exist +oz-channels bootstrap --from 20 --to 25 -p prod-mainnet +# Error: Gap detected in slot sequence: 11-19 + +# Override only if intentional +oz-channels bootstrap --from 20 --to 25 --allow-gaps -p prod-mainnet +``` + +### Step 5.9: Verify End-to-End + +Once channels are provisioned and registered, run a smoke test: + +```bash +oz-channels smoke setup -p prod-mainnet # Deploys a smoke contract (testnet only; mainnet uses bundled contract) +oz-channels smoke run -p prod-mainnet # Submits real transactions through the pool +``` + +Tests covered include both `xdr` and `func + auth` modes against the deployed pool. Test failures here indicate misconfiguration between the infra layer (Terraform) and the application layer (relayer config + plugin config). + +--- + +## 6. Configuration Reference + +### Module-Managed Container Environment Variables + +These are set inside the ECS task definition by the Terraform module and should not be overridden unless you have a specific reason: + +| Env var | Set to | Source | +| --- | --- | --- | +| `HOST` | `0.0.0.0` | Module | +| `STELLAR_NETWORK` | `var.stellar_network` (`mainnet` or `testnet`) | Module | +| `FUND_RELAYER_ID` | `var.fund_relayer_id` (default `channels-fund`) | Module | +| `API_KEY_HEADER` | `x-consumer-key` | Module: keyed to Cloudflare Worker rewriting | +| `REPOSITORY_STORAGE_TYPE` | `redis` | Module: required for production | +| `RESET_STORAGE_ON_START` | `false` | Module | +| `METRICS_ENABLED` | `true` | Module | +| `METRICS_PORT` | `8081` | Module | +| `LOG_FORMAT` | `json` | Module | +| `LOG_LEVEL` | `var.log_level` (default `warn`) | Module | +| `REDIS_URL` | `redis://:6379` | Module: derived from ElastiCache output | +| `REDIS_READER_URL` | `redis://:6379` | Module: read/write split for ElastiCache | +| `AWS_REGION` | Module-derived | Module | +| `AWS_ACCOUNT_ID` | Module-derived | Module | +| `DISTRIBUTED_MODE` | `var.distributed_mode` (default `true`) | Module | +| `QUEUE_BACKEND` | `sqs` (when distributed) or `memory` | Module | +| `SQS_QUEUE_URL_PREFIX` | `https://sqs..amazonaws.com//-` | Module | + +### Optional Module-Managed Env Vars + +| Env var | Activated by | Purpose | +| --- | --- | --- | +| `ALLOWED_FUND_RELAYER_IDS` | `var.allowed_fund_relayer_ids` non-empty | Per-request fund-relayer override (used by x402 patterns) | + +### Production Reference Values + +For operators targeting OpenZeppelin's reference scale (~2M tx/day, ~1000 relayers, 11–25 Fargate tasks of 8 vCPU / 16 GB), these are the env-var values OpenZeppelin actually runs in production. Use them as a calibration point; do not blindly copy without sizing your downstream dependencies (Redis, RPC, KMS rate limits) to match. + +```hcl +container_environment = [ + # Worker concurrency — the biggest tuning surface + { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, + # Non-Stellar workers parked at 1 (Stellar-only deployment) + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, + + # API + plugin concurrency caps + { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, + { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, # master knob — see below + { name = "MAX_CONNECTIONS", value = "4000" }, + + # Timeouts (production uses longer timeouts than defaults) + { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, + { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, + { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, + { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, + + # Rate limits + { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, + { name = "RATE_LIMIT_BURST", value = "500" }, + + # Fee tracking — 100 XLM (1e9 stroops) per API key per 24h + { name = "FEE_LIMIT", value = "1000000000" }, + { name = "FEE_RESET_PERIOD_SECONDS", value = "86400" }, + + # Redis connection pools — sized for 11-task deployment + { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, + { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, + + # Aggressive cleanup of completed transactions (6 minutes) + { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, + + # SQS polling tuning + { name = "SQS_TRANSACTION_REQUEST_WAIT_TIME_SECONDS", value = "2" }, + { name = "SQS_TRANSACTION_SUBMISSION_WAIT_TIME_SECONDS", value = "2" }, + { name = "SQS_TRANSACTION_REQUEST_POLLER_COUNT", value = "3" }, + { name = "SQS_TRANSACTION_SUBMISSION_POLLER_COUNT", value = "3" }, + + # Contract-level pool isolation (sample contract IDs — substitute your own) + { name = "LIMITED_CONTRACTS", value = "C,C" }, + { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, + + # Alternative fund relayer for x402 traffic class (if applicable) + { name = "ALLOWED_FUND_RELAYER_IDS", value = "x402-fund-relayer-id" }, + + { name = "NODE_OPTIONS", value = "--no-deprecation" }, +] +``` + +**`PLUGIN_MAX_CONCURRENCY` is the master knob** for the Channels plugin's worker pool. Per the public-image documentation: it auto-derives most other plugin-internal settings; don't override the derived values. + +### User-Overridable Env Vars + +Anything in `var.container_environment` is merged with the managed list; **user-provided values take precedence**. Common overrides: + +```hcl +container_environment = [ + # Channels plugin-specific + { name = "FEE_LIMIT", value = "100000000" }, # stroops per API key per period + { name = "FEE_RESET_PERIOD_SECONDS", value = "86400" }, # 24h rolling window + { name = "LOCK_TTL_SECONDS", value = "30" }, # channel-account lock TTL (range 3-30) + { name = "MAX_FEE", value = "1000000" }, # max stroops per tx + { name = "LIMITED_CONTRACTS", value = "CABC...,CDEF..." }, # contracts with restricted pool access + { name = "CONTRACT_CAPACITY_RATIO", value = "0.2" }, # restrict listed contracts to 20% of pool + + # Worker concurrency overrides (defaults are sane; tune under load) + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "150" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "100" }, + + # RPC failover tuning + { name = "PROVIDER_FAILURE_THRESHOLD", value = "3" }, # consecutive fails before pause + { name = "PROVIDER_PAUSE_DURATION_SECS", value = "60" }, # pause window + { name = "PROVIDER_FAILURE_EXPIRATION_SECS", value = "60" }, # how long failures count + { name = "RPC_TIMEOUT_MS", value = "10000" }, +] +``` + + +**What `LIMITED_CONTRACTS` does.** A small number of Soroban contracts often dominate the channel-account pool's submission queue. In the OpenZeppelin reference deployment, two limited contracts together account for ~99%+ of all transactions, and the top single contract takes 73–97% of daily volume depending on its onchain phase (for example, long-running mining/harvest cycles). Left unmanaged, contracts at this concentration would starve every other contract of channel-account capacity. `LIMITED_CONTRACTS` lists the contract IDs to cap, and `CONTRACT_CAPACITY_RATIO` (between 0 and 1) sets the maximum fraction of the pool those listed contracts may collectively occupy at any one moment. **OpenZeppelin runs `CONTRACT_CAPACITY_RATIO=0.6`** in production; high-traffic contracts can take up to 60% of the pool, leaving 40% reserved for everyone else (which keeps long-tail traffic responsive even under sustained mining-protocol load). + +**How to populate the list.** Identify offenders empirically: watch per-contract channel-account-checkout metrics over hours or days (the `cloudwatch-exporter` sidecar exposes per-contract counters), and add any contract whose share of in-flight submissions consistently exceeds the rest of the population by an order of magnitude. The contracts in OpenZeppelin's list are publicly identifiable, high-throughput Soroban protocols whose normal operation generates millions of submissions per day; your offenders will be specific to your customer mix. The placeholder values shown above (`CABC...`, `CDEF...`) are illustrative; substitute your own. + + +### Module-Managed Secrets (from SSM Parameter Store) + +The module creates SSM `SecureString` parameters and wires them into the ECS task's `secrets` block. They are referenced by ARN, not by value, so secret rotation can be done in-place via `aws ssm put-parameter` without touching Terraform. + +| Container env var | SSM parameter | Required? | +| --- | --- | --- | +| `API_KEY` | `/relayer-api-key` | Yes | +| `PLUGIN_ADMIN_SECRET` | `/channels-admin-secret` | Yes | +| `WEBHOOK_SIGNING_KEY` | `/webhook-signing-key` | Conditional (set when `var.webhook_signing_key` non-empty) | +| `STORAGE_ENCRYPTION_KEY` | `/storage-encryption-key` | Conditional (set when `var.storage_encryption_key` non-empty) | + +The Terraform `lifecycle { ignore_changes = [value] }` on these parameters means: once created, Terraform will not overwrite the SSM value if you rotate it out-of-band via AWS CLI or Console. This is intentional; it lets you rotate secrets without doing a Terraform apply. + +### Plugin Configuration (Channels Plugin Runtime) + +Beyond the env vars above, the Channels plugin reads configuration baked into the Docker image via `config/config.json`. The `examples/channels-plugin-example` directory in `openzeppelin-relayer` is the reference. Key sections: + +- `relayers[]`: one entry per relayer ID, including the fund relayer (`channels-fund`) with `concurrent_transactions: true`, and one entry per channel account. The bootstrap workflow creates these via the management API, so the JSON file typically contains only the fund relayer; channels are added dynamically. +- `signers[]`: one signer per relayer. For production, every signer should be `aws_kms` (or `google_cloud_kms` if you're on GCP) with an ED25519 key spec. +- `networks[]`: Stellar network definitions including `rpc_urls`. For production, list two independent providers. +- `notifications[]`: webhook endpoints (signed with `WEBHOOK_SIGNING_KEY`). +- `plugins[]`: the Channels plugin registration; ID is `channels` (matches the API path `/api/v1/plugins/channels/call`). + +The published image ships with an **empty `config.json` stub** (empty `relayers[]`, `signers[]`, `notifications[]`. Operators must provide their own at runtime by mounting `/app/config/config.json`. Two patterns: + +- **Mount a file:** simplest; commit your `config.json` to a private artifact store and mount via ECS task volume. +- **Render at startup from secrets:** more secure; use an entrypoint wrapper that fetches signer keys from AWS Secrets Manager or Vault and writes `/app/config/config.json` before the relayer starts. + + +**Minimal `config.json` shape:** four top-level arrays. + +- `signers[]`: start with one entry: your fund-relayer's `aws_kms` signer (key ARN, region, key-spec `ECC_NIST_EDWARDS25519`). +- `relayers[]`: one entry for the fund relayer (`id: "channels-fund"`, points at the signer above, `concurrent_transactions: true`). +- `networks[]`: one Stellar network entry with `rpc_urls[]` weights. +- `plugins[]`: one entry registering the Channels plugin (`id: "channels"`, which is what makes the API path `/api/v1/plugins/channels/call` resolve). + +Channel-account signers and relayers are added at runtime by `oz-channels bootstrap` (section 5.8); they don't need to live in `config.json` at image-build time. + + +### Channel-Account Funding Policy + +Per the `oz-channels bootstrap` defaults: 2 XLM per channel at creation. Each channel needs at minimum the Stellar account reserve (1 XLM = 0.5 XLM × 2 entries) to exist onchain; the extra 1 XLM is operational buffer. Tune via `--starting-balance`. + +The fund relayer pays all fee bumps. Its working balance needs to cover sustained traffic × per-bump fee. At ~23 tx/s sustained and 100 stroops base fee, a multi-day buffer is typically tens to hundreds of XLM depending on congestion-driven fee multipliers. + +### Fee-Bump Policy + +Set via the Channels plugin's `MAX_FEE` env var (default `1000000` stroops = 0.1 XLM). This caps the fee any single transaction can spend. Under network congestion, transactions exceeding this cap are rejected by the plugin rather than submitted with a fee too low to confirm. + +For per-fund-relayer overrides (when `ALLOWED_FUND_RELAYER_IDS` is in use), the `relayer-plugin-channels` README documents per-fund-relayer fee overrides including dynamic inclusion fees. + +--- + +## 7. Operational Playbook + +This section describes routine day-2 operations. The ops-toolkit CLIs (`oz-relayer`, `oz-channels`) are the operator-facing interface for most of these. + +### 7.1: Deploys + +The deploy unit is the container image. The Terraform module is "infra-mostly-stable"; you re-apply it only when adding/removing AWS resources or changing module config. + +**Routine deploy** (new container image): + +1. Push the new image to ECR with a versioned tag (for example, `mainnet-1.3.40`). +2. Update `container_image` in tfvars to point at the new tag. +3. `terraform apply`: this triggers an ECS service update with `deployment_minimum_healthy_percent = 100` and `deployment_maximum_percent = 200`, so the service stays available throughout (new tasks come up, then old ones go down). + +**ALB health-check semantics during deploy:** the ALB target group uses `path = /api/v1/health` with 5-second deregistration delay, 60-second interval, 30-second timeout. Tasks are added to the target group only after passing health checks; old tasks are drained gracefully. + +**Canary deployments (the pattern OpenZeppelin runs in production):** the `relayer-channels-infra` public module ships a single ECS service. OpenZeppelin's internal production stack runs **two ECS services behind one ALB**: a stable service (`-service-mainnet`) and a canary service (`-service-canary`), both pointing at the same ECS cluster, ElastiCache Redis, and SQS queue prefix. The ALB's HTTPS listener uses `weighted_forward` across two target groups, currently configured `{stable: 100, canary: 0}` with stickiness enabled (600s duration) so a given caller stays on one variant across requests. + +To roll out a new image with canary: + +1. Push the new image to ECR with a versioned tag (for example, `mainnet-1.4.3`). +2. Update the canary service's `container_image` to the new tag and bump its `desired_count` (from the "parked" 0 to a small number (for example, 2 tasks) for a ~10% slice. +3. Shift ALB weights from `{stable: 100, canary: 0}` to `{stable: 90, canary: 10}` (or whatever ramp you want). +4. `terraform apply`. +5. Monitor canary-specific metrics for the agreed bake time (the CloudWatch namespace per task is distinct: `RelayerChannelsMainnetCanaryTransactions` vs `RelayerChannelsMainnetTransactions` for stable). +6. Promote: shift weights to `{stable: 0, canary: 100}`, then redeploy stable with the canary's image, then shift weights back to `{stable: 100, canary: 0}` and scale canary back to 0. + +The canary service inherits all the same env vars but with concurrency slightly lower (~150 vs 200 across the worker pools) to limit blast radius if the new image misbehaves. + + +Extending the public `relayer-channels-infra` module to add a canary service is a mechanical addition: copy the existing `ecs_service` block, name it `-canary`, and update the ALB listener's `forward` action to `weighted_forward` with two target groups. The public module doesn't include this today; section 10.10 sketches the key resources and pitfalls. + + +### 7.2: Rollbacks + +To roll back to a previous container image: + +1. Update `container_image` tfvars to the older tag (for example, `mainnet-1.3.39`). +2. `terraform apply`. + +The same blue/green-ish ECS deployment semantics apply: rollback proceeds with `min_healthy = 100`, so the service stays available. + +### 7.3: Scaling Fargate + +The module sets autoscaling bounds via: + +| Variable | Production default | +| --- | --- | +| `desired_count` | 2 | +| `autoscaling_min_capacity` | 2 (defaults to `desired_count`) | +| `autoscaling_max_capacity` | 10 | + +The autoscaling policy details are inherited from the upstream `terraform-aws-modules/ecs/aws//modules/service` module (CPU-based by default). To raise the ceiling under sustained load: + +```hcl +desired_count = 4 +autoscaling_min_capacity = 4 +autoscaling_max_capacity = 20 +``` + +`terraform apply` applies the change without interruption. + +### 7.4: Channel-Pool Management + +The pool grows as your traffic grows. Adding to the pool is non-destructive and idempotent: + +```bash +# Existing pool is 1..200; add slots 201..400 +oz-channels bootstrap --from 201 --to 400 -p prod-mainnet + +# List current channels in the plugin's pool +oz-channels channels list -p prod-mainnet + +# Manually adjust (with caution; protected profiles prompt for confirmation) +oz-channels channels add channel-0050 -p prod-mainnet +oz-channels channels remove channel-0050 -p prod-mainnet +``` + +`oz-channels channels set` replaces the entire registered list (destructive); use `add`/`remove` for incremental changes. + +### 7.5: Per-API-Key Fee Budget Management + +The Channels plugin tracks per-API-key fee consumption when `FEE_LIMIT` and `FEE_RESET_PERIOD_SECONDS` are set. To inspect or adjust: + +```bash +oz-channels fee usage -p prod-mainnet # see consumption +oz-channels fee limit -p prod-mainnet # see configured limit +oz-channels fee set-limit 5000000000 -p prod-mainnet # custom limit (stroops) +oz-channels fee delete-limit -p prod-mainnet # remove custom limit +``` + +This is the surface you will expose to per-customer billing reconciliation: when a customer claims they hit limits, this CLI gives you the source of truth. + +### 7.6: Monitoring Queues + +The Terraform module creates three CloudWatch alarms per queue: + +| Alarm | Threshold | Period | +| --- | --- | --- | +| `--high-depth` | 10,000 messages (status-check queues) or 5,000 (others) | 2 consecutive 5-min periods | +| `--dlq-messages` | 100 messages in DLQ | 1 × 5-min period | +| `--old-messages` | `visibility_timeout × 3` (oldest message age) | 1 × 5-min period | + +By default, `alarm_actions = []`; you must wire these alarms to an SNS topic or PagerDuty integration as a post-deploy operator step. The alarm names follow the `-` pattern so a single SNS subscription on those alarm names captures all queue health alerts. + +### 7.7: Monitoring Redis + +ElastiCache emits standard CloudWatch metrics. Key signals: + +| Metric | Watch for | +| --- | --- | +| `EngineCPUUtilization` | Spikes above 75% sustained | +| `DatabaseMemoryUsagePercentage` | Climb past 70%; capacity headroom for spikes | +| `ReplicationLag` | > 1s sustained (multi-cluster failover scenarios) | +| `CurrConnections` | Near `maxclients` (default 65000) | + +The module emits Redis slow-log and engine-log to `/aws/elasticache/-redis`. Tail with `aws logs tail`. + +### 7.8: Inspecting Transactions and Relayer State + +The `oz-relayer` CLI is the per-transaction inspection surface: + +```bash +# Transaction details +oz-relayer tx show -r channels-fund -p prod-mainnet --json + +# List recent transactions by status +oz-relayer tx list -r channels-fund --status pending -p prod-mainnet + +# Relayer-level state (balance, sequence) +oz-relayer relayer status channels-fund -p prod-mainnet +oz-relayer relayer balance channels-fund -p prod-mainnet + +# Cancel a pending transaction +oz-relayer tx cancel -r channels-fund -p prod-mainnet +``` + +For programmatic access, every command accepts `--json` for stable, parseable output. + +### 7.9: Post-Restart Checklist + +If you ever restart with `RESET_STORAGE_ON_START=true` (which wipes Redis), you need to redo the following (the service will be up but non-functional until these are done): + +1. **Re-create the signer:** call the `/api/v1/signers` endpoint with your KMS key config +2. **Re-create the fund relayer:** via the relayer API using the new signer ID +3. **Re-run the RPC override:** the PATCH to `/api/v1/networks/stellar:mainnet` with your private providers +4. **Re-bootstrap channels:** `oz-channels bootstrap --to -p ` +5. **Fund the fund relayer:** if the onchain account was recreated, send XLM to the new address + +Normal restarts and redeployments (without `RESET_STORAGE_ON_START=true`) preserve everything in Redis; none of the above is needed. + +### 7.10: Optional Lambda Monitors + +Two opt-in Lambda functions are provided by the module: + +**Balance-check Lambda** (`var.enable_balance_check_lambda = true`): +- EventBridge-scheduled (`balance_check_schedule`, default `rate(5 minutes)`) +- Polls the relayer API for fund/channel balances; emits CloudWatch metrics +- Source: `relayer_balance.mjs` in the module + +**ECS restart-on-alarm Lambda** (`var.enable_restart_on_alarm_lambda = true`): +- Subscribes to CloudWatch alarms (you wire which alarms it listens to) +- On alarm `OK → ALARM` transition, forces an ECS service `update-service --force-new-deployment` to rebuild tasks +- Use sparingly; a flapping alarm can cause restart loops. Source: `restart_ecs_on_alarm.mjs` + +--- + +## 8. Debugging Guide + +When a transaction fails, times out, or behaves unexpectedly, locate the failure in the request lifecycle before pulling logs. + +Every transaction follows two paths. The synchronous path covers everything from the client request through auth, fee-budget check, and SQS enqueue, and returns a `tx_id` as the 202 acknowledgment. The async path covers everything after: channel acquisition, transaction build and simulation, channel signing, fund-account fee-bump, RPC submission, and status polling to confirmation. Match the symptom to the path before opening CloudWatch. + +If the request never returned a `tx_id`, the failure is in the synchronous path. Check ECS service events, ALB target health, and the relayer logs for the inbound request. If the request returned a `tx_id` but the transaction never confirmed, the failure is in the async path. Start with `oz-relayer tx show ` to get the transaction's current state, then trace from there. + +Pool exhaustion, sequence drift, and an RPC throttle can all present as "transactions are failing" from the outside; each lives in a different layer and has a different fix. + +**Failure Taxonomy** + +| Symptom | Layer | First action | +| --- | --- | --- | +| No `tx_id` returned | Synchronous: auth, fee budget, or enqueue | Check ECS service events and ALB target health; tail relayer logs for the inbound request | +| `tx_id` returned, never confirmed | Async: channel acquire, build, sign, submit, or poll | `oz-relayer tx show -r channels-fund --json -p ` | +| `POOL_CAPACITY` errors | Channel pool exhausted | §10.1; then bootstrap more channels | +| `INSUFFICIENT_FEE` / stuck at `submitted` | Fee ceiling below network floor | §10.2; raise `MAX_FEE` | +| `TRY_AGAIN_LATER` in logs | Horizon throttle or per-channel saturation | §10.3; check fund account balance and RPC provider health | +| `provider paused` in logs | RPC failover triggered | §8.4; query each RPC provider's health endpoint | +| Sequence errors / `LOCKED_CONFLICT` | Redis sequence counter drift or lock contention | §8.7; inspect the affected channel's Redis key | +| DLQ accumulation | Repeated worker failures | §10.5; inspect DLQ messages for the root error | + +The debugging workflow correlates data across three sources: the relayer API, CloudWatch logs, and the Stellar Horizon API. + +### 8.1: Pick Your Entry Point + +| You have | Start with | +| --- | --- | +| Transaction ID (UUID) | `oz-relayer tx show -r channels-fund --json -p ` | +| Error message | Search logs for the error pattern | +| Time window | Tail logs for that period | +| Stellar tx hash | Query Horizon, then work backwards to the relayer's tx record | +| "What's failing right now" | Run the error-aggregation workflow (Section 8.5) | + +### 8.2: Transaction-ID to Request-ID Correlation + +Relayer logs are JSON with a `span` field containing correlation identifiers: + +```json +{ + "timestamp": "2026-01-28T16:17:45.595556Z", + "level": "DEBUG", + "fields": { "message": "..." }, + "target": "openzeppelin_relayer::domain::transaction::stellar::submit", + "span": { + "job_type": "TransactionRequest", + "relayer_id": "channels-fund", + "request_id": "Some(\"req-1769617060067-9\")", + "tx_id": "f4a33a34-6a0f-491a-8654-821efcea35ec" + } +} +``` + +The workflow: + +1. Get the transaction record: `oz-relayer tx show -r channels-fund --json -p `: gives you `created_at` (for log time-window) and the full state machine history. +2. Filter logs by `tx_id`: + + ```bash + aws logs filter-log-events \ + --log-group-name "/aws/ecs/-/task" \ + --start-time $(date -u -v-30M +%s000) \ + --filter-pattern "" + ``` + +3. Extract `request_id` from any matching log span. +4. Filter logs by `request_id` for the complete cross-task flow (a single request can hop between tasks via SQS). + +**Correlation Flow at a Glance:** + +```mermaid +flowchart TD + A1["Entry: have a tx-id
oz-relayer tx show <tx-id>
-r channels-fund --json -p <env>

Yields: status, created_at, sent_at,
confirmed_at, hash, sequence_number, fee"] + + A2["Entry: have an error / time window
aws logs filter-log-events
--filter-pattern '<error>'
--start-time <epoch-ms>

Yields: JSON log entries"] + + Log["CloudWatch log entry (JSON)
timestamp · level · target (Rust module)
span: { job_type, relayer_id, request_id, tx_id }"] + + Req["Filter by request_id
aws logs filter-log-events
--filter-pattern '<request_id>'

→ full cross-task flow, sorted by timestamp"] + + Stage["Stage map · match span.target
::transaction_request_handler — API → SQS pickup
::stellar::prepare — build + simulate
::transaction_counter_redis — sequence mgmt
::stellar::submit — RPC submission
::status_check_handler — confirmation polling
::notification_handler — webhook delivery"] + + Horizon["Cross-check on-chain (Horizon)
GET /transactions/<hash>
GET /accounts/<source-addr> | jq .sequence
GET /fee_stats | jq '{ledger_capacity_usage, fee_charged, max_fee}'"] + + Root["Common root causes
• RPC degradation — 'provider paused' in logs
• Fee uncompetitive — fee_stats.max_fee > MAX_FEE
• Sequence drift — counter < on-chain sequence
• Pool exhaustion — POOL_CAPACITY
• Lock conflict — LOCKED_CONFLICT
• Horizon rate limit — TRY_AGAIN_LATER
• Tx expired — time_bounds.max < now"] + + A1 --> Log + A2 --> Log + Log -->|extract span.request_id| Req + Req --> Stage + Stage --> Horizon + Horizon --> Root +``` + +### 8.3: Key Log Targets + +When grepping logs, these targets point at specific stages: + +| Log target | Stage | +| --- | --- | +| `openzeppelin_relayer::domain::transaction::stellar::prepare` | Transaction preparation (build + simulate) | +| `openzeppelin_relayer::domain::transaction::stellar::submit` | Transaction submission to Soroban RPC | +| `openzeppelin_relayer::repositories::transaction_counter::transaction_counter_redis` | Sequence counter updates | +| `openzeppelin_relayer::jobs::handlers::transaction_request_handler` | Job pickup from SQS | + +### 8.4: Common Log Patterns to Search + +| Pattern | Indicates | +| --- | --- | +| `provider paused` | RPC failover triggered; one or more providers degraded | +| `error`, `failed`, `timeout` | Generic failure terms | +| `sequence`, `counter` | Sequence-number drift or contention | +| `max_fee.*insufficient` | Fee bump below network minimum during congestion | +| `POOL_CAPACITY` | Channel-account pool exhausted | +| `LOCKED_CONFLICT` | Two workers attempted to acquire the same channel | +| `TRY_AGAIN_LATER` | Horizon-side throttling | + +### 8.5: Error-Aggregation Workflow + +When the question is "what's failing right now": + +1. Query the last hour of logs filtering on error severity: + + ```bash + aws logs filter-log-events \ + --log-group-name "/aws/ecs/-/task" \ + --start-time $(date -u -v-1H +%s000) \ + --filter-pattern '{ $.level = "ERROR" }' + ``` + +2. Categorize errors by stage (fee / sequence / timeout / RPC / signer). +3. Count by transaction status: are these tx that never submitted, submitted but didn't confirm, or confirmed-but-the-caller-saw-an-error? +4. Identify temporal clustering; bursts often correlate with provider events or deploys. + +### 8.6: Stuck-Transaction Workflow + +For a transaction that never confirms: + +1. Check tx status: `oz-relayer tx show ... --json`. If `submitted` with a hash, the relayer believes it sent. +2. Check onchain state: `curl https://horizon.stellar.org/transactions/`: does Horizon see it? +3. Compare fee competitiveness: + + ```bash + curl -s https://horizon.stellar.org/fee_stats | jq '{ledger_capacity_usage, fee_charged, max_fee}' + ``` + + If your fee bump is below `max_fee.mode`, the transaction may sit in the mempool until expiry. + +4. Look for `TRY_AGAIN_LATER` or `provider paused` near the submission timestamp. +5. Verify the transaction's `time_bounds.max` hasn't passed. + +### 8.7: Redis Sequence-Counter Inspection + +The relayer maintains per-account sequence counters in Redis under the key pattern: + +``` +relayer:transaction_counter:channels-fund: +``` + +To inspect (requires a bastion or AWS Session Manager into a task with Redis access; ElastiCache is VPC-scoped): + +```bash +# From inside a task or bastion with VPC access +redis-cli -h --tls +> GET relayer:transaction_counter:channels-fund:GCABCDEF... +``` + +If the relayer's counter is behind the onchain sequence (for example, after a restart or Redis snapshot restore), the next submission will fail with a sequence-conflict error. The relayer has self-healing for this via the `RelayerHealthCheck` job type, but acute drift may need a manual restart of the affected task. + +### 8.8: When to Escalate to ECS Exec + +The Fargate tasks have `enable_execute_command = true`. You can shell into a running task: + +```bash +aws ecs execute-command \ + --cluster \ + --task \ + --container \ + --interactive \ + --command "/bin/sh" +``` + +Use sparingly; production tasks should not need this for routine operations. Common legitimate uses: capturing a network trace during a suspected RPC issue, inspecting in-process state when logs don't tell a clear story. + +--- + +## 9. Security Model + +### 9.1: Secrets Handling + +All secrets are stored as `SecureString` in AWS SSM Parameter Store. The ECS task references them by ARN in the task definition's `secrets` block; ECS injects them into container environment variables at task start. No secret ever appears in: + +- The container image +- Terraform state (only ARNs) +- ECS task definition JSON +- CloudWatch logs (unless your application logs them (which the relayer does not)) + +The Terraform `lifecycle { ignore_changes = [value] }` on SSM parameters means: once provisioned, you can rotate secrets directly via `aws ssm put-parameter --overwrite` without involving Terraform. + +**Rotation procedure for `API_KEY`:** + +```bash +aws ssm put-parameter \ + --name "/-/relayer-api-key" \ + --value "$(openssl rand -hex 32)" \ + --type SecureString \ + --overwrite +``` + +Then force a task restart so the new value is picked up: + +```bash +aws ecs update-service \ + --cluster \ + --service \ + --force-new-deployment +``` + +### 9.2: Network Isolation + +- **ALB ingress:** when Cloudflare is enabled, the ALB security group is restricted to Cloudflare's published IP ranges (the module pulls these from Cloudflare's API). Public ingress to the ALB directly is blocked. When Cloudflare is disabled, you must explicitly populate `alb_allowed_ipv4_cidrs`; by default, an empty allow-list means `0.0.0.0/0` is allowed. +- **ALB egress:** restricted to the VPC CIDR (only the ECS service can be reached). +- **ECS task egress:** allowed to `0.0.0.0/0` (the task needs to reach Soroban RPC, Horizon, AWS APIs, and any webhook destinations). +- **ECS task ingress:** restricted to the ALB security group on the container port; metrics port (8081) is `self` only (sidecar containers only). +- **Redis:** security group allows ingress on port 6379 from the VPC CIDR only. In-transit encryption is enabled (`transit_encryption_enabled = true`, mode `preferred`). +- **SQS:** queues are private by default. The ECS task IAM role has scoped access (`SendMessage`, `ReceiveMessage`, `DeleteMessage`, etc.) on resources matching `-*`. + +### 9.3: IAM Least-Privilege + +The ECS task IAM role is scoped to: + +- `ssm:GetParameter*` on `arn:aws:ssm:::parameter/-/*` +- `sqs:*` on the relayer's queue ARN pattern +- `logs:CreateLogStream`, `logs:PutLogEvents`, `logs:CreateLogGroup` on the relayer's log group +- `cloudwatch:PutMetricData` (unscoped; CloudWatch metrics namespacing happens application-side) +- `aps:RemoteWrite` on the AMP workspace (when Prometheus is enabled) +- `ssmmessages:*` for ECS Exec + +The ECS execution IAM role additionally has: + +- `ssm:GetParameters` on the same SSM prefix (used to inject secrets at task start) + +No `kms:*` permissions are granted to the task role by this module; KMS access for signers is configured at the relayer-config layer per signer (operator provisions a KMS key per fund relayer or per channel-account signer). + +**OpenZeppelin's production pattern for KMS access:** rather than granting `kms:Sign` and `kms:GetPublicKey` permissions through the ECS task IAM role, the production deployment attaches the task role's principal to the **KMS key's resource policy**. This puts authorization on the key (and thus auditable in the key's CloudTrail data plane) rather than on the role. Either pattern works; the resource-policy approach scales better when you have many keys (one per fund relayer or per channel signer at scale). + +```hcl +# Sketch — attach the ECS task role to a KMS key's resource policy +resource "aws_kms_key_policy" "signer_key" { + key_id = aws_kms_key.signer.id + policy = jsonencode({ + Statement = [ + { + Sid = "AllowRelayerTaskRoleSign" + Effect = "Allow" + Principal = { + AWS = [module.relayer_channels.ecs_task_role_arn] + } + Action = ["kms:Sign", "kms:GetPublicKey", "kms:DescribeKey"] + Resource = "*" + } + ] + }) +} +``` + +### 9.4: TLS Posture + +- **ALB:** TLS 1.3 policy (`ELBSecurityPolicy-TLS13-1-2-2021-06`), HTTPS on 443, HTTP redirects to HTTPS with 301. +- **Redis:** in-transit encryption enabled, mode `preferred` (clients may connect over TLS or not; operator can tighten to `required` if all clients support TLS). +- **Cloudflare to ALB:** the **edge certificate** (client-facing TLS) is issued automatically by Cloudflare the moment a DNS record for your domain is created in the zone; there is no manual cert-provisioning step and no Terraform plumbing required for it (Universal SSL handles this). For end-to-end TLS between Cloudflare and your ALB, set the **zone SSL mode** to "Full (strict)" so Cloudflare validates the ALB's ACM cert on the back-half. The SSL-mode setting is independent of edge-cert issuance and is configured either via the Cloudflare dashboard or under Terraform control via the Cloudflare provider (`cloudflare_zone_settings_override`). + +### 9.5: Cloudflare-Side Auth Pattern + +The Worker (`worker.mjs`) does two distinct things to inbound requests: + +1. **User-key validation:** caller-provided `Authorization: Bearer ` is hashed (`SHA-256(KEY_SALT:user-key)`) and looked up in KV. Match required, scope (mainnet/testnet) enforced. +2. **Upstream injection:** the request is rewritten before forwarding to the ALB: `Authorization` becomes `Bearer `, and a new `x-consumer-key: ` header is added. + +This means: +- The upstream relayer always sees the static API key (one secret to manage). +- The relayer's Channels plugin uses `x-consumer-key` for per-caller fee tracking (configured via `API_KEY_HEADER=x-consumer-key`). +- A user key being compromised does not compromise the upstream relayer's auth; only that user's quota. + +**Key salt rotation:** `KEY_SALT` is the salt mixed into the hash. Rotating it invalidates all existing user keys (they hash to a different value). Plan key-salt rotations as a forced re-issue event; communicate ahead, run `/gen` again, and accept downtime for callers who don't re-fetch. + +### 9.6: KMS for Stellar Signers + +Channel-account and fund-relayer signers should use AWS KMS for production. Per signer: + +- **KMS key spec:** `ECC_NIST_EDWARDS25519`: the AWS KMS asymmetric-sign Ed25519 curve. This is what Stellar requires. Supported signing algorithms on this KeySpec: `ED25519_SHA_512` and `ED25519_PH_SHA_512` (pre-hashed variant). For comparison, EVM signers use `ECC_SECG_P256K1` (secp256k1). +- **IAM grants on the KMS key:** `kms:Sign` and `kms:GetPublicKey` to the ECS task role's principal. +- **CloudTrail Data Access logging** should be enabled on the key for compliance; every signature is then auditable with caller IAM principal, timestamp, request ID, and outcome. + +For rotation, follow the side-by-side procedure: provision a new KMS key, register a new signer/relayer ID, fund the new onchain account, drain the old, retire. On Stellar the onchain account address is derived from the signer's public key, so rotation always means a new account. + +### 9.7: What Is NOT in This Module + +- KMS keys for signers (operator provisions per signer) +- VPC, subnets, NAT gateway (operator provides; the module attaches to your existing VPC) +- Bastion / Session Manager access to Redis (operator's choice of access pattern) +- WAF rules on the ALB (operator's choice; module provides ingress IP restriction only) +- Multi-region replication of the deployment (single-region only) + +--- + +## 10. Key Gotchas + +### 10.1: Channel-Account Exhaustion (`POOL_CAPACITY`) + +**Symptom:** API responses contain `POOL_CAPACITY` errors; logs show requests waiting on channel acquisition. + +**Root cause:** the channel pool is too small for current concurrent in-flight transaction count. + +**Sizing formula:** + +``` +min_pool = ceil(target_TPS × avg_settlement_seconds × safety_factor) +``` + +For Stellar with ~5s settlement, `safety_factor = 1.5–2.0`. At 23 TPS sustained, `23 × 5 × 1.5 = 173` channels minimum. + +**Recovery:** `oz-channels bootstrap --from --to ` adds capacity incrementally. + +**Prevention:** alert on pool utilization above 80%, not just on `POOL_CAPACITY` errors. + +### 10.2: Fee-Bump Tuning Under Congestion + +**Symptom:** transactions submitted successfully but never confirm; Horizon `fee_stats` shows `max_fee.mode` above your configured `MAX_FEE`. + +**Root cause:** the Stellar fee market shifted above your static fee ceiling, so submissions are stuck at `INSUFFICIENT_FEE` (or sit unconfirmed until they expire). + + +**Channels Fee Policy: Read This Before Tuning.** + +The Channels plugin uses **static fee values for both limited and non-limited contracts** (the single `MAX_FEE` setting applies to everything). On `INSUFFICIENT_FEE`, the plugin **does not dynamically bump the fee**; it simply resubmits the transaction at the same fee until it confirms or expires. + +This is a deliberate policy. Because channels absorbs the inclusion fee on behalf of callers, automatic fee-bumping on `INSUFFICIENT_FEE` would cause the service's own in-flight transactions to compete against each other on price, dragging the effective fee floor up for the whole pool and turning every congestion spike into a self-inflicted fee-escalation spiral. Static fee + resubmit keeps the price floor an operator-controlled knob rather than a market-driven one, particularly important for a service that is free at the API boundary. + + +**Recovery:** raise `MAX_FEE` in the container env vars and re-deploy. Range to consider: `1,000,000` (0.1 XLM, default) up to `10,000,000` (1 XLM) for sustained congestion. The change applies uniformly across limited and non-limited contracts; there is no per-class fee override. + +**Prevention:** + +- Alert on transaction confirmation lag exceeding your SLA; sustained lag is the leading indicator that the market has overtaken `MAX_FEE`. +- Periodically diff Horizon `/fee_stats.max_fee.mode` against your configured `MAX_FEE`. If the market mode sits above your ceiling for more than a few minutes, in-flight submissions are likely stuck. +- Treat `MAX_FEE` as a control-plane setting, not a per-transaction parameter. Re-evaluate it during congestion events and after any sustained mainnet-wide fee shifts. + +### 10.3: `TRY_AGAIN_LATER` from RPC (Provider Throttle and Per-Channel Saturation) + +**Symptom:** `TRY_AGAIN_LATER` errors from RPC, or `provider paused` log entries. + +**Root cause:** two distinct origins, only one of which is a provider-side rate limit. + +1. **Provider throttling:** your RPC provider's per-key rate limit is being hit. The Terraform module does not configure RPC URLs (that lives in the relayer config inside the Docker image), so the recovery lever is at the relayer-config layer, not the infra layer. +2. **Per-channel saturation:** Stellar RPC also returns `TRY_AGAIN_LATER` when a single channel account (one relayer) has more in-flight transactions than the network or provider will accept on its behalf. This has been observed both during channel-account creation/scaling and during steady-state high-throughput broadcasting. + + +**Plugin auto-mitigation.** On `TRY_AGAIN_LATER`, the Channels plugin pulls a different *idle* channel from the pool (stack-based selection) and resubmits the transaction on that one. In steady state this manifests as a brief retry, not a user-visible failure; but if the entire pool is saturated and no idle channel is available, the error surfaces to the caller. + + +**Recovery:** + +- *Provider throttling:* add a secondary provider via `custom_rpc_urls` per-relayer or `rpc_urls` per-network. Use weights to load-balance. +- *Pool saturation:* bootstrap more channel accounts via `oz-channels bootstrap --to `. The plugin spreads load across the wider pool automatically. + +**Prevention:** + +- Always run at least two independent RPC providers for mainnet. Negotiate rate limits at peak load against your projected throughput. +- Size the channel-account pool with headroom; if peak in-flight transaction count routinely exceeds ~70% of pool size, you are a burst away from saturation and the per-channel path will start surfacing to callers. + +### 10.4: Redis Sizing Under Burst + +**Symptom:** `EngineCPUUtilization` or `DatabaseMemoryUsagePercentage` near 100%; queue depths climbing. + +**Root cause:** the `cache.r7g.large` default may be undersized for sustained 23 TPS with the full ~1000-relayer pool. + +**Recovery:** scale up: `redis_node_type = "cache.r7g.xlarge"` or larger; `terraform apply`. ElastiCache supports online resize but expect a brief failover during the operation if `redis_num_cache_clusters > 1`. + +**Prevention:** alert when memory usage exceeds 70%; baseline CPU during expected peak before sizing. + +### 10.5: SQS DLQ Accumulation + +**Symptom:** the `--dlq-messages` CloudWatch alarm fires. + +**Root cause:** a class of messages is being received more than `max_receive_count` times (6 for most queues; 2 for `transaction-submission`; 1000 for status-check variants). The 2-receive limit on `transaction-submission` is intentional; submission failures should not retry many times. + +**Recovery:** +- Inspect DLQ messages: `aws sqs receive-message --queue-url --max-number-of-messages 10`. +- The most common DLQ pattern is `transaction-submission` failures from a sustained RPC outage. Investigate root cause; if it's transient, you can re-drive from DLQ back to the main queue: + ```bash + aws sqs start-message-move-task --source-arn --destination-arn + ``` +- For persistent failures, delete the messages and root-cause the underlying issue. + +**Prevention:** wire the DLQ alarm to a high-priority pager; DLQ growth is rarely transient. + +### 10.6: Cloudflare KV Rate Limits + +**Symptom:** users reporting `/gen` failures or `Forbidden Access` errors at low rates. + +**Root cause:** the Worker enforces `gen_ip_rate_hour` (default 2) on the `/gen` endpoint. A user behind a shared NAT may share an IP with other API-key generators. + +**Recovery:** raise `gen_ip_rate_hour` cautiously. Higher values reduce the friction for legitimate users but reduce anti-abuse effectiveness. + +**Prevention:** if you front the service with multiple IP egress points (for example, a corporate VPN), set IP-aware rate limits in Cloudflare's WAF rather than in `worker.mjs`. + +### 10.7: `API_KEY_HEADER` Mismatch + +**Symptom:** users report `Unauthorized` from the relayer even though Cloudflare accepts their key. + +**Root cause:** if you have overridden `API_KEY_HEADER` to something other than `x-consumer-key`, the Cloudflare Worker (which still sends `x-consumer-key`) and the relayer's plugin (looking for whatever `API_KEY_HEADER` says) are mismatched. + +**Recovery:** either leave `API_KEY_HEADER` at the module default (`x-consumer-key`) or update the Worker code in lockstep. + +### 10.8: Bootstrap Gap Detection + +**Symptom:** `oz-channels bootstrap --from 20 --to 25` errors with "Gap detected in slot sequence: 11-19". + +**Root cause:** the bootstrap guards against accidentally provisioning a sparse pool, which can complicate ops. + +**Recovery:** if the gap is intentional, pass `--allow-gaps`. Otherwise fill the gap first. + +### 10.9: Cloudflare Worker Deployment Is 100% Strategy + +**Symptom:** none, but be aware. + +**Root cause:** the Terraform module deploys the Worker via `cloudflare_workers_deployment` with `strategy = "percentage"` and `versions = [{ percentage = 100, version_id = ... }]`. There is no canary at the Cloudflare-Worker level. + +**Recovery:** if you want gradual Worker rollout, deploy two `cloudflare_worker_version` resources and adjust the `percentage` field over time. Doing this from Terraform alone is awkward; typically operators do this via Wrangler CLI for Workers and treat the Terraform-managed version as the production reference. + +### 10.10: Canary Deployment at the ECS Layer + +**The public `relayer-channels-infra` Terraform module is a single ECS service with autoscaling.** OpenZeppelin's internal production stack adds a second ECS service for canary traffic; that pattern is documented in section 7.1 of this guide. Recap: + +- Two ECS services in one cluster: `-service-mainnet` (stable) and `-service-canary`. Both share Redis, SQS, secrets, and IAM. +- One ALB with HTTPS listener using `weighted_forward` across two target groups. Default split `{stable: 100, canary: 0}`; canary parked at `desired_count = 0`. +- Stickiness enabled (600s) so a caller stays on one variant per session. +- Promote/rollback by adjusting target-group weights and image tags. + +**Building this on top of the public module:** copy the existing `ecs_service` resource and parameterize the variant name (for example, add a `variant` input that defaults to `mainnet` and accepts `canary`); declare a second target group bound to the canary service; and change the ALB listener's `forward` action to `weighted_forward` across the two target groups, with stickiness enabled. Park the canary at `desired_count = 0` initially; raise it when promoting. Until the public module ships canary natively, this is operator-side composition. + +**Pitfalls to plan around:** + +- **Shared Redis means state mixes.** A bad canary that corrupts state in Redis affects all callers. Keep canary images conservative; only promote *fully validated* image candidates to canary, not first-pass builds. +- **Stickiness blinds you to small-percentage canary issues.** If your canary takes 5% of traffic but 95% of users get stuck to the stable variant, you only learn from 5% of caller behavior. Decide canary bake-times accordingly; at 10% canary, plan a 6-hour minimum bake; at 25%, a 2-hour minimum. +- **Auto-restart Lambda and canary is dangerous.** If you wire the optional ECS-restart-on-alarm Lambda (section 7.10) to canary alarms, a flapping canary will force restarts that mask the underlying issue. Either disable the Lambda for canary alarms or set very conservative alarm thresholds. + +--- + +## 11. Appendix + +### Reference Repositories + +| Repo | Purpose | +| --- | --- | +| `OpenZeppelin/relayer-channels-infra` | Terraform modules: the deployment unit | +| `OpenZeppelin/openzeppelin-relayer`, `examples/channels-plugin-example` | Source of the Docker image that runs in Fargate | +| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | + +### Reference: SQS Queue Tuning Summary + +| Queue | Visibility timeout | Max receives | Purpose | +| --- | --- | --- | --- | +| `transaction-request` | 300s | 6 | Initial transaction requests | +| `transaction-submission` | 120s | 2 | Submission to Stellar RPC | +| `status-check` | 300s | 1000 | Generic status polling | +| `status-check-evm` | 300s | 1000 | EVM status polling (unused for Stellar-only deployments) | +| `status-check-stellar` | 300s | 1000 | Stellar status polling | +| `notification` | 180s | 6 | Webhook delivery | +| `token-swap-request` | 300s | 6 | Token swap processing (Solana-specific; unused for Stellar) | +| `relayer-health-check` | 300s | 6 | Periodic relayer-state probes | + +The `max_receive_count = 1000` on status-check queues reflects that status polling is expected to retry many times before a transaction confirms; the `max_receive_count = 2` on submission queues reflects that submission failures should not retry indefinitely. These values are baked into `sqs.tf` and not exposed as Terraform variables; change them by modifying the module. + +### Reference: Module Outputs + +| Output | Description | +| --- | --- | +| `ecs_cluster_name` | Name of the ECS cluster | +| `ecs_cluster_arn` | ARN of the ECS cluster | +| `ecs_service_name` | Name of the ECS service | +| `ecr_repository_name` | Name of the ECR repository (if created by the module) | +| `ecr_repository_url` | URL of the ECR repository | +| `alb_dns_name` | DNS name of the Application Load Balancer | +| `domain_name` | The configured domain name for the service | +| `acm_certificate_arn` | ARN of the ACM certificate in use | +| `redis_primary_endpoint` | Primary endpoint for Redis writes | +| `redis_reader_endpoint` | Reader endpoint for Redis reads | +| `sqs_queue_urls` | Map of queue name to URL for all 8 SQS queues | +| `prometheus_workspace_id` | ID of the Amazon Managed Prometheus workspace (if enabled) | +| `prometheus_endpoint` | Remote-write endpoint for the Prometheus workspace | +| `ssm_parameter_prefix` | SSM Parameter Store path prefix used by the module | +| `cloudflare_worker_name` | Name of the Cloudflare Worker (if Cloudflare is enabled) | + +### Reference: Environment-Based Defaults + +Variables whose defaults differ by the `environment` input variable value: + +| Variable | `prod` | All other environments | +| --- | --- | --- | +| `desired_count` | 2 | 1 | +| `autoscaling_max_capacity` | 10 | 4 | +| `redis_node_type` | `cache.r7g.large` | `cache.t4g.medium` | +| `redis_num_cache_clusters` | 2 | 1 | +| `alb_deletion_protection` | `true` | `false` | +| `log_retention_days` | 30 | 7 | +| `task_log_retention_days` | 365 | 7 | +| Resource name suffix | _(none)_ | `-` | + +### Reference: Conditional Resource Creation + +Resources provisioned only when specific inputs are set: + +| Condition | Resources created or skipped | +| --- | --- | +| `container_image = ""` | Creates an ECR repository; otherwise uses the supplied image URI | +| `acm_certificate_arn = ""` | Requests a new ACM certificate via DNS validation; otherwise uses the provided ARN | +| `enable_cloudflare = true` | Deploys a Cloudflare Worker and wires the ALB behind Cloudflare | +| `enable_cloudflare = false` | ALB is exposed directly without Cloudflare proxy | +| `enable_balance_check_lambda = true` | Deploys the fund-relayer balance check Lambda and its CloudWatch alarm | +| `enable_restart_on_alarm_lambda = true` | Deploys the ECS restart Lambda triggered by CloudWatch alarms | +| `enable_cloudwatch_exporter = true` | Deploys the CloudWatch metrics exporter sidecar | +| `enable_prometheus = true` | Creates an Amazon Managed Prometheus workspace and configures the exporter | +| `webhook_signing_key != ""` | Stores the key in SSM and enables webhook HMAC signature verification | +| `storage_encryption_key != ""` | Stores the key in SSM and enables Redis data encryption at rest | +| `alb_access_logs_bucket != ""` | Enables ALB access logging to the specified S3 bucket | + +--- + +## Support and Feedback + +For questions on this guide, deployment issues, or improvements to the reference repositories, contact OpenZeppelin engineering through your established channel. Public-facing community channels: + +- OpenZeppelin Forum +- Issues on the relevant reference repository (Terraform module: `relayer-channels-infra`; plugin: `relayer-plugin-channels`; relayer core: `openzeppelin-relayer`) diff --git a/src/navigation/stellar.json b/src/navigation/stellar.json index 3b2e93d7..8b64502c 100644 --- a/src/navigation/stellar.json +++ b/src/navigation/stellar.json @@ -517,6 +517,11 @@ "name": "Stellar X402 Facilitator Guide", "url": "/relayer/1.5.x/guides/stellar-x402-facilitator-guide" }, + { + "type": "page", + "name": "AWS Operator Deployment Guide", + "url": "/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide" + }, { "type": "page", "name": "GCP Operator Deployment Guide", From 49abd2ea0acbd2fb0d6ee56edba52fde1800c95e Mon Sep 17 00:00:00 2001 From: stevep0z <255929980+stevep0z@users.noreply.github.com> Date: Fri, 26 Jun 2026 19:30:53 -0500 Subject: [PATCH 3/4] chore: add root-level Stellar operator guides and update navigation --- .../stellar-relayer-aws-operator-guide.mdx | 48 +- .../stellar-relayer-gcp-operator-guide.mdx | 898 ++++----- .../stellar-relayer-aws-operator-guide.mdx | 1680 +++++++++++++++++ .../stellar-relayer-gcp-operator-guide.mdx | 1004 ++++++++++ src/navigation/stellar.json | 4 +- 5 files changed, 3111 insertions(+), 523 deletions(-) create mode 100644 content/relayer/guides/stellar-relayer-aws-operator-guide.mdx create mode 100644 content/relayer/guides/stellar-relayer-gcp-operator-guide.mdx diff --git a/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx b/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx index 51934d36..568488e7 100644 --- a/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx +++ b/content/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide.mdx @@ -672,17 +672,32 @@ flowchart TD ### Step 5.8: Bootstrap the Channel-Account Pool -The module deploys the *infrastructure* but does not provision the *channel accounts* (Stellar entities). For this you use the `oz-channels` CLI from `OpenZeppelin/ops-toolkit`. Install and configure a profile: +The module deploys the *infrastructure* but does not provision the *channel accounts* (Stellar entities). For this you use the `oz-channels` CLI included in the `cli/` directory of this repo. + +Install the CLI: ```bash -# Install -git clone https://github.com/OpenZeppelin/ops-toolkit.git -cd ops-toolkit +# From the root of this repo +cd cli bun install +bun run build + +# Link the CLIs globally cd packages/oz-channels && bun link cd ../oz-relayer && bun link -# Create a profile for your environment +# Verify +oz-channels --help +oz-relayer --help +``` + + +Requires the [Bun](https://bun.sh) runtime (Node.js 22+ compatible). + + +Set up a profile for your environment: + +```bash oz-channels profile init prod-mainnet # Prompts for: URL (your channels.your-company.com), API key, plugin ID (channels), # admin secret, network (mainnet), test account @@ -735,6 +750,27 @@ flowchart TD **Production sizing reference:** the reference deployment runs ~1,000 relayers total. For a fresh deploy that needs to handle the full reference load, bootstrap several hundred channel accounts. For lower-load deployments, start with 50–100 and scale incrementally. +#### Scaling Beyond ~100 Channels + +When scaling the pool aggressively (e.g. 100 → 1000 channels), `oz-channels bootstrap` will start failing with `TRY_AGAIN_LATER` or `tx_bad_seq` errors from Horizon. This happens because every `createAccount` operation uses the fund relayer (`channels-fund`) as the transaction source, serializing all submissions on a single sequence number. Under high concurrency, Horizon rejects the overlapping submissions. + +Use `scripts/fund-new-channels.ts` instead; it routes the transaction source through an existing funded channel account (e.g. `channel-0001`) while keeping the fund relayer as the operation source (so the treasury still pays). It also batches up to 100 `createAccount` ops per transaction, so a 100→1000 scale-up fits in ~9 submissions. + +```bash +npx tsx scripts/fund-new-channels.ts \ + --env mainnet \ + --api-key \ + --source-relayer channel-0001 \ + --fund-relayer channels-fund \ + --from 101 --to 1000 \ + --starting-balance 2 \ + --report fund-report.json +``` + +The script is idempotent; it preflights every slot via the relayer API and Horizon, skipping any account already funded onchain. Safe to re-run. + +#### Gap Detection + Gap-detection guards against accidentally provisioning a sparse pool: ```bash @@ -940,7 +976,7 @@ For per-fund-relayer overrides (when `ALLOWED_FUND_RELAYER_IDS` is in use), the ## 7. Operational Playbook -This section describes routine day-2 operations. The ops-toolkit CLIs (`oz-relayer`, `oz-channels`) are the operator-facing interface for most of these. +This section describes routine day-2 operations. The `oz-relayer` and `oz-channels` CLIs (in the `cli/` directory of this repo) are the operator-facing interface for most of these. ### 7.1: Deploys diff --git a/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx b/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx index 76ee00e0..0d99784a 100644 --- a/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx +++ b/content/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide.mdx @@ -2,58 +2,17 @@ title: 'Hosted Stellar Relayer on GCP: Operator Deployment Guide' --- -A step-by-step guide for infrastructure teams running a hosted Stellar relayer service on Google Cloud Platform. +This guide covers deploying and operating the Stellar Relayer Channels service on GCP. The infrastructure runs on Cloud Run backed by Memorystore Redis, Pub/Sub for distributed job processing, and Cloud KMS for transaction signing, with optional Cloudflare Workers for API-key management and per-user rate limiting. -**Who this is for:** infrastructure operators who have run production GCP workloads but are new to OpenZeppelin's relayer stack. +Work through the deployment steps in order; each step produces configuration or keys that later steps depend on. For the AWS deployment, see the [AWS Operator Deployment Guide](/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide). -**What you get:** a hosted Stellar Channels service in your own GCP project, sized to serve the same workload OpenZeppelin runs today (roughly 2M+ transactions per day across about 2,500 relayers). - -## 1. Overview - -OpenZeppelin runs a hosted Stellar relayer service at `channels.openzeppelin.com` (mainnet) and `channels.openzeppelin.com/testnet` (testnet). The service takes on the hard parts of submitting Stellar transactions in parallel: managing a pool of channel accounts, fee bumping, arbitrating sequence numbers, and failing over between RPC providers. Downstream callers just talk to a simple HTTP API. - -This guide shows you how to run that same service in your own GCP project. - -### What You End Up With - -By the end of this guide you will have: - -- A production-ready hosted Stellar Channels service in your own GCP project, served from a domain you control (for example, `channels.your-company.com`). -- A Cloud Run compute tier with autoscaling, sitting behind an External HTTPS Load Balancer with a Google-managed SSL certificate. -- Memorystore Redis for state and deferred-job scheduling. In production this runs as STANDARD_HA with automatic failover. -- Eight Pub/Sub topics and subscriptions that handle the distributed transaction-processing pipeline (when `queue_backend = "pubsub"`). -- An optional Cloudflare Worker in front of the load balancer for self-serve API-key issuance (the `/gen` flow), per-user rate limiting, and usage analytics. -- A Secret Manager entry for every secret. Secrets are injected as environment variables when the container starts. -- Cloud KMS for ED25519 transaction signing. The module provisions a keyring and an asymmetric signing key. -- An Artifact Registry remote repository configured to proxy the public ECR image, giving Cloud Run a GCP-native pull path. -- Optional Cloud Functions for fund-relayer balance monitoring. - -The service handles two transaction-submission modes: - -- **Signed XDR mode:** the caller signs a complete Stellar transaction envelope and submits it. The service only fee-bumps and submits. -- **Soroban `func` + `auth` mode:** the caller submits a Soroban host function plus authorization entries. The service assembles the transaction, simulates it, signs with a channel account, fee-bumps, and submits. - -### What This Guide Assumes You Already Have - -- A strong GCP background: VPC, Cloud Run, IAM, Cloud DNS, Memorystore, Pub/Sub. -- Terraform fluency (1.5.0 or later). -- A target GCP project where you can create the full resource set. -- A domain you control. DNS can live in Route53, Cloud DNS, or another provider. -- Optionally, a Cloudflare account if you want the `/gen` API-key gateway. - -## 1.5 How Channels Works on Stellar - -Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time. This is the constraint that limits parallel throughput on Stellar. - -The Channels service works around it with a pool of dedicated source accounts: the channel accounts. Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. The pool size determines how many transactions can run in parallel. - -The fund account is a separate Stellar account that holds the XLM balance. When the service submits a transaction, it wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys. +--- -The pool size you provision in Step 5.10 is your throughput ceiling. See §10.1 for the sizing formula before you bootstrap. +## 1. Architecture -## 2. Architecture +The service connects several GCP-managed components into a single transaction processing pipeline. Understanding this layout helps with capacity planning and narrows the search space when diagnosing failures. Most operational issues map to one specific layer. -### Cloud Architecture +### 1.1. Cloud Architecture ```mermaid flowchart TD @@ -95,25 +54,19 @@ flowchart TD GAR -.->|image pull| CloudRun ``` -The whole stack above is provisioned by the `gcp` Terraform module in `OpenZeppelin/relayer-channels-infra`. You consume it either by cloning the repo or by referencing it as an external module from your own Terraform. - -### Components - | Component | GCP Service | Purpose | | --- | --- | --- | | Edge gateway | Cloudflare Worker + KV (optional) | API-key issuance, rate limiting, usage tracking | -| Load balancer | External HTTPS LB + Google-managed cert | TLS termination, HTTPS-only, health-checked routing | -| Compute | Cloud Run v2 Service | Runs the relayer container with autoscaling | +| Load balancer | External HTTPS LB + Google-managed cert | TLS termination, health-checked routing | +| Compute | Cloud Run v2 | Runs the relayer container with autoscaling | | State | Memorystore Redis 7.2 | Transaction records, sequence counters, distributed locks | -| Queue | 8 Pub/Sub topics + 8 subscriptions | Distributed transaction processing pipeline | +| Queue | 8 Pub/Sub topics + subscriptions | Distributed transaction processing | | Secrets | Secret Manager | API keys, admin secrets, encryption keys | | Signing | Cloud KMS (EC_SIGN_ED25519) | Transaction signing for fund + channel accounts | | Image registry | Artifact Registry (remote repo) | Proxies ECR Public image for Cloud Run | -| Observability | Cloud Logging + Cloud Monitoring | Application logs, metrics | | Networking | VPC + VPC Connector + Private Service Access | Private connectivity to Memorystore | -| Optional monitors | Cloud Functions + Cloud Scheduler | Balance-check function | -### App Architecture (Channels Plugin Runtime) +### 1.2. App Architecture (Channels Plugin Runtime) ```mermaid flowchart TD @@ -144,7 +97,7 @@ flowchart TD Pipeline -->|submit + poll| Stellar ``` -### Transaction Lifecycle +### 1.3. Transaction Lifecycle ```mermaid sequenceDiagram @@ -160,8 +113,8 @@ sequenceDiagram participant RPC as Soroban RPC Caller->>CF: POST / · Bearer user-key - CF->>CF: hash + KV lookup
+ scope check - CF->>LB: rewrite Bearer→static-key
set x-consumer-key=user-key + CF->>CF: hash + KV lookup + scope check + CF->>LB: rewrite Bearer→static-key, set x-consumer-key LB->>API: TLS terminate · forward API->>Plugin: route /plugins/channels/call Plugin->>Redis: check fee budget @@ -170,7 +123,7 @@ sequenceDiagram Plugin-->>Caller: 202 Accepted + tx_id rect rgba(200, 220, 255, 0.4) - Note over Plugin,RPC: Async worker pickup (after 202 returns) + Note over Plugin,RPC: Async worker pickup Plugin->>Redis: acquire channel account Plugin->>RPC: build + simulate tx RPC-->>Plugin: assembled envelope @@ -179,7 +132,6 @@ sequenceDiagram Plugin->>KMS: fee-bump w/ fund signer KMS-->>Plugin: fee-bumped envelope Plugin->>RPC: submit signed envelope - RPC-->>Plugin: submitted (no hash yet) Plugin->>PS: publish status-check-stellar loop until confirmed or expired @@ -191,9 +143,9 @@ sequenceDiagram end ``` -### Pub/Sub Queue Topology +### 1.4. How Pub/Sub Queues Work -The relayer's distributed processing layer uses eight Pub/Sub topics with pull subscriptions. The Pub/Sub backend handles retries through Redis sorted sets (a store-and-run-when-due pattern), so there are no dead-letter topics. +Eight topics with pull subscriptions handle the transaction pipeline. Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) sit in Redis sorted sets until due, then get published to the topic. The topic only ever carries ready-to-process jobs; no dead-letter topics needed. ```mermaid flowchart TD @@ -223,123 +175,124 @@ flowchart TD DeferredQ -. publish when due .-> Topics ``` -**Deferred job pattern:** Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) are stored in Redis sorted sets keyed by their due time. A due-sweep worker runs every 1 to 5 seconds per queue type, claims due jobs from Redis, and publishes them to the topic. The topic only ever carries jobs that are already due. +### 1.5. How Channels Works on Stellar + +Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time; this is the constraint that caps parallel throughput on Stellar. + +The Channels service works around it with a pool of dedicated source accounts (channel accounts). Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. A separate fund account holds the XLM balance. When submitting, the service wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys. -### Capacity Profile +The pool size you provision in [section 4.10](#410-bootstrap-channels) is your throughput ceiling. See [section 12.1](#121-channel-pool-exhaustion) for the sizing formula. -The reference deployment OpenZeppelin runs handles a growing load of about 3M transactions per day, served by roughly 1,000 relayers (fund and channel-account entities combined). The module defaults are sized conservatively for new deployments. Expect to grow into something closer to the production shape as your workload scales. +### 1.6. Resource Sizing + +Module defaults work for getting started. Operators are advised to bump them as traffic grows. | Resource | Module default (prod) | Current GCP deployment | | --- | --- | --- | -| CPU | 1 vCPU | **4 vCPU** | -| Memory | 2 Gi | **8 Gi** | -| Min instances | 2 | **3** | -| Max instances | 10 | **20** | +| CPU | 1 vCPU | 4 vCPU | +| Memory | 2 Gi | 8 Gi | +| Min instances | 2 | 3 | +| Max instances | 10 | 20 | | Redis tier | STANDARD_HA | STANDARD_HA | | Redis memory | 5 GB | 5 GB | -The module defaults work fine for a new deployment that is ramping up. The GCP deployment was raised above defaults to handle concurrent transaction stress testing. Tune further as your workload grows. +The module auto-adjusts sizing by environment (`prod` vs everything else): + +| Setting | prod | other | +|---------|------|-------| +| Min instances | 2 | 1 | +| Max instances | 10 | 4 | +| CPU always allocated | yes | no | +| Redis tier | STANDARD_HA | BASIC | +| Redis memory | 5 GB | 1 GB | +| LB deletion protection | on | off | +| Log retention | 30 days | 7 days | --- -## 3. Prerequisites +## 2. Prerequisites -GCP access, tooling, and Stellar-side accounts must be in place before you run `terraform apply`. +Gather everything in this section before running `terraform apply`. Missing any item will block either the initial deployment or the post-deploy bootstrap steps. -### Accounts and Access +### 2.1. Accounts and Access -- A **GCP project** with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings. -- A **service account** for Terraform with these roles: - - `roles/editor` for general resource creation - - `roles/resourcemanager.projectIamAdmin` to grant IAM roles to service accounts - - `roles/compute.networkAdmin` for VPC peering used by Private Service Access - - `roles/cloudkms.admin` to create KMS keyrings and keys - - `roles/pubsub.admin` to create topics and subscriptions and set IAM policies - - `roles/secretmanager.admin` to create secrets and set IAM policies - - `roles/run.admin` to manage Cloud Run services - - `roles/artifactregistry.admin` to create repositories and set IAM policies -- A **domain** you control, with access to create DNS records (Route53, Cloud DNS, or another provider). -- Optionally, a **Cloudflare account** with a zone matching your domain, if you want the `/gen` API-key gateway. +- **GCP project** with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings. +- **Service account** for Terraform with these roles: `editor`, `resourcemanager.projectIamAdmin`, `compute.networkAdmin`, `cloudkms.admin`, `pubsub.admin`, `secretmanager.admin`, `run.admin`, `artifactregistry.admin` +- **Domain** with DNS access (Route53, Cloud DNS, or other) +- (Optional) **Cloudflare account** for the `/gen` API-key flow -### Tooling +### 2.2. Tooling | Tool | Version | Why | | --- | --- | --- | | Terraform | 1.5.0 or later | Module language constraints | | Google provider | 5.0 or later, below 7.0 | Pinned in `versions.tf` | -| Cloudflare provider | ~> 5.0 | Required even when `enable_cloudflare = false` (a Terraform constraint) | +| Cloudflare provider | ~> 5.0 | Required even when `enable_cloudflare = false` | | gcloud CLI | recent stable | Auth, Artifact Registry, debugging | | Node.js 18+ and pnpm 10+ | recent stable | Only if you modify the Channels plugin | -### Stellar-Side Prerequisites +### 2.3. Stellar-Side Prerequisites -- **Soroban RPC access:** for mainnet, use at least two independent private providers from different infrastructure operators (QuickNode and Ankr are the providers OpenZeppelin uses). "Independent" means different node operators, not different API wrappers on the same underlying node. The public image ships with a public RPC endpoint by default; override it with private providers after deployment (see Step 5.8). -- **Initial XLM funding:** each Stellar account requires a minimum base reserve of 1 XLM. For 200 channel accounts plus the fund account, budget at least 250 XLM before transaction fees. Fund the fund relayer's Stellar account first — `oz-channels bootstrap` draws channel account balances from it. +- **Soroban RPC access:** at least two independent private providers from different operators recommended for mainnet. The public image ships with the default public RPC; you override it after deployment (see [section 4.7](#47-dns-and-ssl)). +- **XLM** to fund the relayer's Stellar account and bootstrap channel accounts. Budget at least 250 XLM for 200 channel accounts plus the fund account. -### Reference Repositories +### 2.4. Repos You'll Reference -| Repo | Role | Visibility | -| --- | --- | --- | -| `OpenZeppelin/relayer-channels-infra` | Terraform modules and operator CLIs (`oz-relayer`, `oz-channels`) | Public | -| `OpenZeppelin/openzeppelin-relayer` | The relayer application | Public | -| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | Public | +| Repo | What it is | +| --- | --- | +| `OpenZeppelin/relayer-channels-infra` | This repo: Terraform modules + operator CLIs | +| `OpenZeppelin/openzeppelin-relayer` | The relayer application | +| `OpenZeppelin/relayer-plugin-channels` | Channels plugin (TypeScript) | --- -## 4. Environments +## 3. Environments -We recommend running separate environments with isolated state: +Run stg and prod as separate Terraform workspaces with isolated state: -| Environment | Stellar network | GCP project pattern | Cloud Run service | Pub/Sub prefix | +| Env | Network | Working directory | Pub/Sub prefix | VPC connector CIDR | | --- | --- | --- | --- | --- | -| `prod` | Stellar Mainnet | Production project | `relayer-channels-service` | `relayer-mainnet-prod-` | -| `stg` | Stellar Testnet | Same or separate project | `relayer-channels-stg-service` | `relayer-testnet-stg-` | - -The module derives service naming from `app_name` plus `environment`. When `environment = "prod"`, the resource-name suffix is dropped. For other environments, names are suffixed with `-`. +| `stg` | testnet | `examples/gcp/` | `relayer-testnet-stg-` | `10.8.0.0/28` | +| `prod` | mainnet | `examples/gcp-prod/` | `relayer-mainnet-prod-` | `10.9.0.0/28` | -Each environment gets its own: - -- Terraform state (use separate GCS backend prefixes). -- Terraform working directory (`examples/gcp/` for stg, `examples/gcp-prod/` for prod). -- VPC connector CIDR range (for example `10.8.0.0/28` for stg and `10.9.0.0/28` for prod if they share a VPC). -- Secret Manager secrets, KMS keys, and Pub/Sub topics. -- Cloudflare Worker, if enabled, with distinct names like `relayer-channels-stg-gcp-gateway`. +Use different CIDRs if both environments share a VPC. Resource names auto-suffix with `-` except for `prod`. --- -## 5. Step-by-Step Deployment +## 4. Deployment -Full provisioning sequence from authentication through end-to-end verification. Steps 5.1–5.4 set up credentials and configuration; 5.5–5.6 set up the container image and apply infrastructure; 5.7–5.11 wire up DNS, RPC endpoints, signers, and channel accounts. +Work through the steps below in order on a fresh deployment. Each step produces output or configuration that later steps depend on. -### Step 5.1: Set Up Authentication +### 4.1. Authenticate ```bash export GOOGLE_APPLICATION_CREDENTIALS="$HOME/path/to/service-account-key.json" ``` -If your GCP org blocks `gcloud auth application-default login`, use a service account key file instead (IAM & Admin > Service Accounts > Keys > Create new key > JSON). +If your org blocks `gcloud auth application-default login`, create a service account key in IAM & Admin > Service Accounts > Keys. -### Step 5.2: Get the Module +### 4.2. Get the Module -**Option A, reference as an external module (recommended):** +Reference it directly from GitHub: ```hcl module "relayer_channels" { source = "git::https://github.com/OpenZeppelin/relayer-channels-infra.git//modules/gcp?ref=main" - # ... variables + # ... } ``` -**Option B, clone the repo:** +Or clone and use the examples: ```bash git clone https://github.com/OpenZeppelin/relayer-channels-infra.git -cd relayer-channels-infra/examples/gcp # or examples/gcp-prod +cd relayer-channels-infra/examples/gcp # stg +cd relayer-channels-infra/examples/gcp-prod # prod ``` -### Step 5.3: Configure the Terraform Backend +### 4.3. Configure the Terraform Backend -In `versions.tf`, configure remote state. Do not keep state on a laptop in production. +In `versions.tf`, configure remote state: ```hcl terraform { @@ -350,32 +303,26 @@ terraform { } ``` -Initialize: - -```bash -terraform init -``` - -### Step 5.4: Create Your tfvars +### 4.4. Create Your Tfvars ```bash cp terraform.tfvars.example terraform.tfvars ``` -Minimum required configuration: +Minimum config: ```hcl project_id = "my-gcp-project" region = "us-east1" -environment = "prod" # or "stg" +environment = "prod" network = "default" subnetwork = "default" domain_name = "channels.your-company.com" -container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-latest" -stellar_network = "mainnet" # or "testnet" +stellar_network = "mainnet" queue_backend = "pubsub" +container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-latest" -# Secrets, never commit these +# Secrets — never commit these relayer_api_key = "" # set via TF_VAR_relayer_api_key channels_admin_secret = "" # set via TF_VAR_channels_admin_secret storage_encryption_key = "" # set via TF_VAR_storage_encryption_key @@ -387,66 +334,66 @@ Generate secrets: export TF_VAR_relayer_api_key="$(uuidgen | tr '[:upper:]' '[:lower:]')" export TF_VAR_channels_admin_secret="$(openssl rand -base64 32)" export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)" -export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)" # must be base64-encoded 32 bytes +export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)" # must be base64, not hex ``` -### Step 5.5: Set Up Artifact Registry +### 4.5. Set Up Artifact Registry -Cloud Run cannot pull directly from ECR Public. Configure an Artifact Registry remote repository to proxy it: +Cloud Run can't pull from ECR Public directly. Set up a remote repo to proxy it: 1. GCP Console > **Artifact Registry** > **Create Repository** 2. Format: **Docker**, Mode: **Remote**, Source: **Custom**, URL: `https://public.ecr.aws` -3. Name it `ecr-public`, choose your region +3. Name it `ecr-public`, pick your region -Then reference the proxied image in your `container_image` tfvar (as shown in Step 5.4). +Then reference it in `container_image` in your tfvars (as shown in [section 4.4](#44-create-your-tfvars)). -Tag scheme: `mainnet-` (pinned, recommended for prod), `mainnet-latest` (tracks latest), `testnet-`, `testnet-latest`. +Tag scheme: `mainnet-` (pinned, use in prod), `mainnet-latest` (moves), `testnet-`, `testnet-latest`. -The public image ships with a public Soroban RPC endpoint that rate-limits under production load. Override it with private providers after deployment in Step 5.8. +The public image ships with `mainnet.sorobanrpc.com` as the default RPC. Override it with private providers after deployment (see [section 4.7](#47-dns-and-ssl)). -### Step 5.6: Plan and Apply +### 4.6. Deploy ```bash +terraform init terraform plan -out plan.tfplan terraform apply plan.tfplan ``` -The initial apply takes 10 to 15 minutes. Memorystore provisioning is the slowest leg. Private Service Access peering and SSL cert provisioning also take a few minutes. +Takes 10–15 min. Memorystore creation is the slowest part. -**Key outputs:** +Key outputs: | Output | Used for | | --- | --- | -| `cloud_run_service_name` | Service management, `gcloud run` commands | -| `cloud_run_service_uri` | Direct Cloud Run access (bypasses the LB) | | `load_balancer_ip` | DNS record creation | -| `redis_host` | Manual Redis inspection (from a VM in the VPC) | -| `pubsub_topics` | Map of queue names to Pub/Sub topic names | -| `kms_signing_key_id` | Full KMS key ID for signer creation | -| `artifact_registry_url` | Artifact Registry URL | +| `cloud_run_service_name` | Service management | +| `kms_signing_key_id` | Signer creation | +| `artifact_registry_url` | Image pull path | -### Step 5.7: Set Up DNS and SSL +### 4.7. DNS and SSL -The Google-managed SSL certificate needs DNS to point at the load balancer IP before it can provision. +The Google-managed cert needs DNS pointing at the LB IP before it provisions. **Without Cloudflare:** - -1. Create an A record: `channels.your-company.com` to ``. -2. Wait 15 to 60 minutes for the certificate to provision (check status in GCP Console > Network Services > Load Balancing > certificate tab). +1. Create an A record: `channels.your-company.com` → `` +2. Wait 15–60 min for cert to go ACTIVE **With Cloudflare:** +1. Create Cloudflare A record → LB IP (proxy OFF, grey cloud) +2. Create Route53 A record → LB IP +3. Wait for cert to go ACTIVE +4. Change Route53 to CNAME → `channels.your-company.com.cdn.cloudflare.net` +5. Turn Cloudflare proxy ON (orange cloud) -1. Create a Cloudflare A record: `channels.your-company.com` to `` (proxy OFF initially, grey cloud). -2. Create a Route53 A record: `channels.your-company.com` to ``. -3. Wait for the Google-managed cert to become ACTIVE. -4. Switch Route53 to a CNAME: `channels.your-company.com` to `channels.your-company.com.cdn.cloudflare.net`. -5. Turn the Cloudflare proxy ON (orange cloud). + +If the cert stays `FAILED_NOT_VISIBLE` for 30+ min, bump the cert name suffix in `load-balancer.tf` (e.g. `-cert-v2` → `-cert-v3`) and re-apply. `create_before_destroy` swaps it without downtime. + -### Step 5.8: Override RPC Endpoints +### 4.8. Override RPC Endpoints -The public image ships with a public Soroban RPC endpoint that rate-limits under production load. After the service is healthy, override it with private providers. This is a one-time call — the config persists in Redis across restarts. +The public image uses the free public Soroban RPC, which rate-limits under load. After the service is healthy, override it with your private providers. This is a **one-time call** (the config persists in Redis). ```bash curl -s \ @@ -472,12 +419,10 @@ curl -s -H "Authorization: Bearer " \ Use at least two independent providers from different operators. The relayer load-balances by weight and rotates on failure. -Re-run this PATCH only if you restart with `RESET_STORAGE_ON_START=true`, which wipes Redis including the network config. Normal restarts and redeployments preserve it. +You only need to re-run this after a `RESET_STORAGE_ON_START=true` restart, which wipes Redis. Normal restarts preserve it. -### Step 5.9: Create the Fund-Relayer Signer - -Create a Cloud KMS signer using the provided script: +### 4.9. Create the Signer ```bash ENV=mainnet API_KEY="$TF_VAR_relayer_api_key" \ @@ -485,9 +430,7 @@ GCP_SA_KEY_FILE="$HOME/path/to/sa-key.json" \ ./scripts/gcp-kms-signer.sh ``` -This calls the relayer API with `"type": "google_cloud_kms"` and creates a signer backed by the Cloud KMS key that Terraform provisioned. - -Then create the fund relayer: +Then create the fund relayer via the relayer API: ```bash curl -s -X POST https://channels.your-company.com/api/v1/relayers \ @@ -504,66 +447,64 @@ curl -s -X POST https://channels.your-company.com/api/v1/relayers \ }' ``` -### Step 5.10: Bootstrap the Channel-Account Pool +### 4.10. Bootstrap Channels -Size the pool before bootstrapping. Formula: `min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor)`. Stellar settlement averages 5 to 7 seconds; use 1.5x as a safety factor. At 23 TPS sustained that gives 173 channels minimum (see §10.1 for detail). For a new deployment with no existing traffic, 50 to 100 channels is a reasonable starting point. Use `--dry-run` to preview what will be created before committing. +Size the pool before bootstrapping. Formula: `min_pool = ceil(target_TPS × avg_settlement_seconds × 1.5)`. Stellar settlement averages 5–7 seconds. At 23 TPS sustained that gives 173 channels minimum. Use `--dry-run` to preview before committing. -Install the `oz-channels` CLI from the `cli/` directory in this repo: +Install the CLI from `cli/` in this repo: ```bash -# From the root of relayer-channels-infra -cd cli -bun install -bun run build - -# Link the CLIs globally +cd cli && bun install && bun run build cd packages/oz-channels && bun link cd ../oz-relayer && bun link - -# Verify -oz-channels --help -oz-relayer --help ``` -Requires the [Bun](https://bun.sh) runtime (Node.js 22+ compatible). - -Create a profile and bootstrap: +Set up a profile and bootstrap: ```bash oz-channels profile init prod-mainnet -# Prompts for: URL, API key, plugin ID (channels), admin secret, network - -# Preview -oz-channels bootstrap --to 200 --dry-run -p prod-mainnet -# Provision -oz-channels bootstrap --to 200 -p prod-mainnet +oz-channels bootstrap --to 200 --dry-run -p prod-mainnet # preview +oz-channels bootstrap --to 200 -p prod-mainnet # provision ``` -### Step 5.11: Verify End-to-End +#### 4.10.1. Scaling Beyond ~100 Channels + +When scaling the pool aggressively (e.g. 100 → 1000 channels), `oz-channels bootstrap` will start failing with `TRY_AGAIN_LATER` or `tx_bad_seq` errors from Horizon. This happens because every `createAccount` operation uses the fund relayer (`channels-fund`) as the transaction source, serializing all submissions on a single sequence number. Under high concurrency, Horizon rejects the overlapping submissions. + +Use `scripts/fund-new-channels.ts` instead, it routes the transaction source through an existing funded channel account (e.g. `channel-0001`) while keeping the fund relayer as the operation source (so the treasury still pays). It also batches up to 100 `createAccount` ops per transaction, so a 100→1000 scale-up fits in ~9 submissions. ```bash -# Health check -curl -sS https://channels.your-company.com/api/v1/health +npx tsx scripts/fund-new-channels.ts \ + --env mainnet \ + --api-key \ + --source-relayer channel-0001 \ + --fund-relayer channels-fund \ + --from 101 --to 1000 \ + --starting-balance 2 \ + --report fund-report.json +``` -# Generate an API key (if Cloudflare is enabled) -curl -X POST https://channels.your-company.com/gen +The script is idempotent, it preflights every slot via the relayer API and Horizon, skipping any account already funded onchain. Safe to re-run. -# Smoke test +### 4.11. Verify + +```bash +curl -sS https://channels.your-company.com/api/v1/health oz-channels smoke run -p prod-mainnet ``` -A healthy service returns `{"status":"ok"}` on the health check. The smoke test submits a test transaction end-to-end and polls for confirmation — success prints a confirmed transaction ID. If the smoke test times out without confirmation, check channel pool size (`oz-channels channels list -p prod-mainnet`) and fund account balance (`oz-relayer relayer balance channels-fund -p prod-mainnet`) before debugging further. +A healthy service returns `{"status":"ok"}`. The smoke test submits a test transaction end-to-end and polls for confirmation; success prints a confirmed transaction ID. If it times out, check channel pool size and fund account balance before debugging further. --- -## 6. Configuration Reference +## 5. Configuration Reference -Reference for all environment variables and secrets the module manages automatically. See §11 for the full Terraform variable listing. +Most environment variables are managed by the Terraform module and should not be overridden without a specific reason. The tables below document what the module sets automatically and which values operators should tune for production scale. -### Module-Managed Container Environment Variables +### 5.1. Module-Managed Container Environment Variables The Terraform module sets these. Do not override them unless you have a specific reason. @@ -572,123 +513,110 @@ The Terraform module sets these. Do not override them unless you have a specific | `HOST` | `0.0.0.0` | Module | | `STELLAR_NETWORK` | `var.stellar_network` | Module | | `FUND_RELAYER_ID` | `var.fund_relayer_id` | Module | -| `API_KEY_HEADER` | `x-consumer-key` | Module, keyed to the Cloudflare Worker rewrite | +| `API_KEY_HEADER` | `x-consumer-key` | Module | | `REPOSITORY_STORAGE_TYPE` | `redis` | Module | | `RESET_STORAGE_ON_START` | `false` | Module | | `METRICS_ENABLED` | `true` | Module | | `METRICS_PORT` | `8081` | Module | | `LOG_FORMAT` | `json` | Module | | `LOG_LEVEL` | `var.log_level` | Module | -| `REDIS_URL` | `redis://:` | Module, derived from Memorystore | -| `REDIS_READER_URL` | `redis://:` | Module, falls back to primary on BASIC tier | +| `REDIS_URL` | `redis://:` | Module | +| `REDIS_READER_URL` | `redis://:` | Module | | `GCP_PROJECT_ID` | `var.project_id` | Module | | `GCP_REGION` | `var.region` | Module | | `DISTRIBUTED_MODE` | `var.distributed_mode` | Module | -| `QUEUE_BACKEND` | `var.queue_backend` (when distributed) | Module | -| `PUBSUB_TOPIC_PREFIX` | Auto-derived: `relayer-{network}-{environment}` | Module | +| `QUEUE_BACKEND` | `var.queue_backend` | Module | +| `PUBSUB_TOPIC_PREFIX` | `relayer-{network}-{environment}` | Module | | `PUBSUB_PROJECT_ID` | `var.project_id` | Module | -### Module-Managed Secrets (from Secret Manager) +### 5.2. Module-Managed Secrets | Container env var | Secret Manager ID | Required? | Notes | | --- | --- | --- | --- | -| `API_KEY` | `{app_name}-relayer-api-key` | Yes | Authenticates all API requests to the relayer | -| `PLUGIN_ADMIN_SECRET` | `{app_name}-channels-admin-secret` | Yes | Required for channel management operations | -| `WEBHOOK_SIGNING_KEY` | `{app_name}-webhook-signing-key` | Optional | Only created when `webhook_signing_key` is set in tfvars. Required if you use webhook notifications, otherwise omit it. | -| `STORAGE_ENCRYPTION_KEY` | `{app_name}-storage-encryption-key` | Optional | Only created when `storage_encryption_key` is set in tfvars. Encrypts sensitive data at rest in Redis. Strongly recommended for production. Must be base64-encoded 32 bytes (`openssl rand -base64 32`). | +| `API_KEY` | `{app_name}-relayer-api-key` | Yes | Authenticates all API requests | +| `PLUGIN_ADMIN_SECRET` | `{app_name}-channels-admin-secret` | Yes | Required for channel management | +| `WEBHOOK_SIGNING_KEY` | `{app_name}-webhook-signing-key` | Optional | Only set if using webhook notifications | +| `STORAGE_ENCRYPTION_KEY` | `{app_name}-storage-encryption-key` | Optional | Encrypts data at rest in Redis. Strongly recommended for prod. Must be base64-encoded 32 bytes. | -The `lifecycle { ignore_changes = [secret_data] }` on secret versions means that once a secret is created, Terraform will not overwrite the value if you rotate it through `gcloud` or the Console. - -**Rotation procedure:** +Rotation procedure: ```bash -# Update the secret echo -n "new-value" | gcloud secrets versions add \ relayer-channels-relayer-api-key --data-file=- \ --project=your-project -# Force Cloud Run to pick up the new value gcloud run services update relayer-channels-service \ --region=us-east1 --project=your-project \ --update-labels="redeploy=$(date +%s)" ``` -### Production Reference Values +### 5.3. Production Reference Values -If you are targeting OpenZeppelin's reference scale (about 2M+ tx/day), these are the env-var values to tune: +If you are targeting OpenZeppelin's reference scale (~2M+ tx/day), these are the env vars to tune: ```hcl container_environment = [ - # Worker concurrency - { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, - { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, - { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, - { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, - { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, - { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, - { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, - { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, - - # API + plugin concurrency - { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, - { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, - { name = "MAX_CONNECTIONS", value = "4000" }, - - # Timeouts - { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, - { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, - { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, - { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, - - # Rate limits - { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, - - # Redis pools - { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, - { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, - - # Transaction cleanup - { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, - - # Contract-level pool isolation - { name = "LIMITED_CONTRACTS", value = "C,C" }, - { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, + { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, + { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, + { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, + { name = "MAX_CONNECTIONS", value = "4000" }, + { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, + { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, + { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, + { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, + { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, + { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, + { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, + { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, + { name = "LIMITED_CONTRACTS", value = "C,C" }, + { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, ] ``` -### Environment-Based Defaults +--- -| Setting | Production | Non-production | -| --- | --- | --- | -| Min Cloud Run instances | 2 | 1 | -| Max Cloud Run instances | 10 | 4 | -| CPU always allocated | Yes | No | -| Redis tier | STANDARD_HA (failover) | BASIC | -| Redis memory | 5 GB | 1 GB | -| LB deletion protection | Enabled | Disabled | -| Log retention | 30 days | 7 days | +## 6. Cloudflare (Optional) ---- +When enabled, a Cloudflare Worker handles API-key issuance (`/gen`), per-key rate limiting, and proxies requests to the LB with static-key injection. -## 7. Operational Playbook +```hcl +enable_cloudflare = true +cloudflare_api_token = "your-token" +cloudflare_zone_id = "your-zone-id" +cloudflare_account_id = "your-account-id" +relayer_static_api_key = "same-as-your-relayer_api_key" +key_salt = "" +cf_analytics_api_token = "your-token" +``` + +`relayer_static_api_key` should match your `relayer_api_key`; the Worker swaps every user's Bearer token for this key upstream. `key_salt` is used to hash user keys before storing in KV. + +### 6.1. Without Cloudflare + +The `/gen` endpoint is not available; there's no self-service API-key generation. Callers authenticate directly with the `relayer_api_key` you configured. If you need per-user keys or rate limiting without Cloudflare, build that into your own API gateway layer in front of the load balancer. -Day-2 operations: routine deploys, rollbacks, scaling, channel-pool management, and observability. For initial provisioning, see §5. +--- -### 7.1 Deploys +## 7. Operations -Routine deploy (new container image): +Routine operations follow the same `terraform apply` workflow as the initial deployment. Stellar-specific operations (managing the channel pool, inspecting transactions) use the CLIs in `cli/`. -1. Build and push the new image to Artifact Registry (or update the remote repo tag). -2. Update `container_image` in tfvars to the new tag. -3. Run `terraform apply`. Cloud Run creates a new revision and routes traffic to it. +### 7.1. Deploys -### 7.2 Rollbacks +To deploy a new version, update `container_image` in your tfvars and run `terraform apply`. Cloud Run creates a new revision and shifts traffic over automatically with no downtime. -Set `container_image` back to the previous tag and run `terraform apply`. Cloud Run keeps previous revisions available for instant rollback. +### 7.2. Rollbacks -### 7.3 Scaling +To roll back, set `container_image` to the previous version tag in your tfvars and run `terraform apply`. -Adjust in tfvars: +### 7.3. Scaling ```hcl cpu = "4" @@ -697,43 +625,18 @@ min_instance_count = 3 max_instance_count = 20 ``` -Running `terraform apply` applies the change without interruption. +Run `terraform apply` to pick up the new limits. Cloud Run handles the transition without downtime. -### 7.4 Channel-Pool Management +### 7.4. Channel Pool ```bash -# Add slots 201..400 -oz-channels bootstrap --from 201 --to 400 -p prod-mainnet - -# List current channels +oz-channels bootstrap --from 201 --to 400 -p prod-mainnet # grow the pool oz-channels channels list -p prod-mainnet - -# Add or remove individual channels oz-channels channels add channel-0050 -p prod-mainnet oz-channels channels remove channel-0050 -p prod-mainnet ``` -### 7.5 Monitoring Pub/Sub - -Check queue health in **GCP Console > Pub/Sub > Subscriptions > Metrics tab**: - -| Metric | Watch for | -| --- | --- | -| `num_undelivered_messages` | A growing backlog means processing is falling behind | -| `oldest_unacked_message_age` | Above 60s sustained means workers may be stuck | -| Pull/Ack operations | Healthy when messages are consumed as fast as they arrive | - -### 7.6 Monitoring Redis - -Check in **GCP Console > Memorystore > Instance > Monitoring tab**: - -| Metric | Watch for | -| --- | --- | -| CPU utilization | Spikes above 75% sustained | -| Memory usage | Climbing past 70% | -| Connected clients | Approaching the connection limit | - -### 7.7 Inspecting Transactions +### 7.5. Transactions ```bash oz-relayer tx show -r channels-fund -p prod-mainnet --json @@ -741,22 +644,22 @@ oz-relayer tx list -r channels-fund --status pending -p prod-mainnet oz-relayer relayer balance channels-fund -p prod-mainnet ``` -### 7.8 Observability +--- -The relayer emits structured JSON logs and Prometheus-format metrics. On GCP, these map to Cloud Logging and Cloud Monitoring. +## 8. Observability -#### Cloud Logging +The service emits structured JSON logs to Cloud Logging, Cloud Run request metrics, and Pub/Sub queue metrics. Set up the log-based metrics and alerting policies below before putting the service under production load. -Cloud Run streams `stdout` and `stderr` to Cloud Logging automatically. With `LOG_FORMAT=json`, the relayer produces structured entries with fields like `level`, `target`, `span.tx_id`, `span.relayer_id`, and `span.request_id`. +### 8.1. Logs -Viewing logs: +Cloud Run streams structured JSON logs to Cloud Logging. ```bash -# Recent errors +# Errors in the last hour gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' \ --project=your-project --limit=20 --freshness=1h --format='value(textPayload)' -# Filter by transaction ID +# Filter by tx ID gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:""' \ --project=your-project --limit=20 --freshness=1h @@ -765,89 +668,68 @@ gcloud logging tail 'resource.type="cloud_run_revision" AND resource.labels.serv --project=your-project ``` -In the Console: Cloud Logging > Logs Explorer, then filter by `resource.type="cloud_run_revision"` and `resource.labels.service_name=""`. - -#### Cloud Monitoring Built-In Metrics +### 8.2. Cloud Run Metrics -Cloud Run and Pub/Sub emit metrics to Cloud Monitoring automatically, with no agent required. +Console > Cloud Run > Service > Metrics: -Cloud Run metrics (GCP Console > Cloud Run > Service > Metrics tab): - -| Metric | What it tells you | +| Metric | Signal | | --- | --- | -| `run.googleapis.com/container/cpu/utilization` | CPU usage per instance. Sustained above 80% means scale up. | -| `run.googleapis.com/container/memory/utilization` | Memory usage. Sustained above 70% risks OOM. | -| `run.googleapis.com/request_count` | Request throughput by response code. Watch for 5xx spikes. | -| `run.googleapis.com/request_latencies` | p50/p95/p99 latency. Watch for degradation. | -| `run.googleapis.com/container/instance_count` | Active instances. Confirms autoscaling behavior. | -| `run.googleapis.com/container/startup_latencies` | Cold-start time. High values affect first-request latency. | +| `container/cpu/utilization` | >80% sustained → scale up | +| `container/memory/utilization` | >70% → risk of OOM | +| `request_count` by status | 5xx spikes | +| `request_latencies` | p95/p99 degradation | +| `container/instance_count` | autoscaling behavior | + +### 8.3. Pub/Sub Metrics -Pub/Sub metrics (GCP Console > Pub/Sub > Subscription > Metrics tab): +Console > Pub/Sub > Subscription > Metrics: -| Metric | What it tells you | +| Metric | Signal | | --- | --- | -| `pubsub.googleapis.com/subscription/num_undelivered_messages` | Queue depth. A growing backlog means processing is falling behind. | -| `pubsub.googleapis.com/subscription/oldest_unacked_message_age` | How long the oldest message has waited. Above 60s sustained means workers may be stuck. | -| `pubsub.googleapis.com/subscription/pull_message_operation_count` | Pull throughput. Confirms workers are active. | -| `pubsub.googleapis.com/subscription/ack_message_operation_count` | Ack throughput. Confirms messages are being processed. | +| `num_undelivered_messages` | growing backlog → falling behind | +| `oldest_unacked_message_age` | >60s → workers stuck | +| `pull_message_operation_count` | confirms workers are active | -Memorystore metrics (GCP Console > Memorystore > Instance > Monitoring tab): +### 8.4. Memorystore Metrics -| Metric | What it tells you | +Console > Memorystore > Instance > Monitoring: + +| Metric | Signal | | --- | --- | -| `redis.googleapis.com/stats/cpu_utilization` | Redis CPU. Spikes above 75% sustained need attention. | -| `redis.googleapis.com/stats/memory/usage_ratio` | Memory usage. Climbing past 70% means you should plan capacity. | -| `redis.googleapis.com/stats/connected_clients` | Connection count. Watch for approaching limits. | -| `redis.googleapis.com/stats/commands_processed` | Command throughput. Correlates with transaction volume. | +| CPU utilization | >75% sustained | +| Memory usage ratio | >70% | +| Connected clients | near limit | -#### Log-Based Metrics +### 8.5. Log-Based Metrics -Create custom metrics from log patterns in **Cloud Logging > Log-based Metrics > Create Metric**: +Create in Cloud Logging > Log-based Metrics > Create Metric: | Metric name | Filter | Purpose | | --- | --- | --- | -| `relayer/errors` | `resource.type="cloud_run_revision" AND severity>=ERROR` | Total error rate | -| `relayer/pool_capacity` | `textPayload:"POOL_CAPACITY"` | Channel pool exhaustion events | +| `relayer/errors` | `severity>=ERROR` | Total error rate | +| `relayer/pool_capacity` | `textPayload:"POOL_CAPACITY"` | Pool exhaustion events | | `relayer/provider_paused` | `textPayload:"provider paused"` | RPC failover events | -| `relayer/tx_confirmed` | `textPayload:"confirmed"` | Transaction confirmation rate | - -Or through gcloud: - -```bash -gcloud logging metrics create relayer-errors \ - --project=your-project \ - --description="Relayer error count" \ - --log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' -``` -#### Alerting +### 8.6. Alerting -Create alert policies in **Cloud Monitoring > Alerting > Create Policy**: - -| Alert | Metric | Condition | Severity | -| --- | --- | --- | --- | -| High error rate | `relayer/errors` (log-based) | More than 50 errors in 5 min | Critical | -| Cloud Run high CPU | `container/cpu/utilization` | Above 80% for 10 min | Warning | -| Cloud Run high memory | `container/memory/utilization` | Above 70% for 10 min | Warning | -| Pub/Sub backlog growing | `subscription/num_undelivered_messages` | Above 5000 for 10 min | Warning | -| Pub/Sub old messages | `subscription/oldest_unacked_message_age` | Above 300s for 5 min | Critical | -| Pool exhaustion | `relayer/pool_capacity` (log-based) | Above 0 in 5 min | Critical | +Key alert policies to set up in Cloud Monitoring > Alerting: -Configure notification channels (email, Slack, PagerDuty) in **Cloud Monitoring > Alerting > Notification Channels**. - -#### Prometheus Metrics - -The relayer exposes Prometheus-format metrics on port `8081` at `/debug/metrics/scrape` (enabled by `METRICS_ENABLED=true`). When `enable_prometheus = true`, the Cloud Run service account has `monitoring.metricWriter` permissions for Google Cloud Managed Prometheus. +| Alert | Condition | Severity | +| --- | --- | --- | +| High error rate | >50 errors in 5 min | Critical | +| Cloud Run high CPU | >80% for 10 min | Warning | +| Cloud Run high memory | >70% for 10 min | Warning | +| Pub/Sub backlog | >5000 messages for 10 min | Warning | +| Pub/Sub old messages | >300s for 5 min | Critical | +| Pool exhaustion | `POOL_CAPACITY` log > 0 in 5 min | Critical | -To scrape these metrics: +### 8.7. Prometheus -- Use Google Cloud Managed Prometheus with a sidecar collector. -- Run a self-hosted Prometheus instance that scrapes the Cloud Run service. -- Rely on the built-in Cloud Run metrics above for most operational needs. +The relayer exposes metrics at `:8081/debug/metrics/scrape`. Scrape with Google Cloud Managed Prometheus or your own Prometheus instance. -### 7.9 Stellar-Side Monitoring +### 8.8. Stellar-Side Monitoring -GCP metrics reflect service health. These signals reflect Stellar network health; monitor both. +GCP metrics reflect service health. These check the Stellar network side; monitor both. **Fund account balance:** @@ -855,17 +737,17 @@ GCP metrics reflect service health. These signals reflect Stellar network health oz-relayer relayer balance channels-fund -p prod-mainnet ``` -Alert when the balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently — transactions submit but cannot be paid for. +Alert when balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently. -**Ledger close time:** Stellar closes a ledger roughly every 5 seconds under normal conditions. Sustained close times above 10 seconds indicate network stress; settlement latency will exceed the assumptions used in your channel pool sizing. Query Horizon to check: +**Ledger close time:** Stellar closes a ledger roughly every 5 seconds normally. Sustained close times above 10 seconds indicate network stress and inflate settlement latency beyond your pool sizing assumptions. ```bash curl -sS "https://horizon.stellar.org/ledgers?order=desc&limit=5" | jq '._embedded.records[] | {sequence, closed_at}' ``` -**`TRY_AGAIN_LATER` in logs:** Horizon is rejecting transactions due to fee competition. This is a Stellar congestion event, not a service failure. Raise `MAX_FEE` (see §10.7). If `TRY_AGAIN_LATER` appears alongside `provider paused`, check RPC provider health first — an unresponsive provider can force retries against a congested fallback. +**`TRY_AGAIN_LATER` in logs:** Horizon is rejecting transactions due to fee competition. Raise `MAX_FEE` (see [section 12.7](#127-fee-bump-tuning-under-congestion)). If it appears alongside `provider paused`, check RPC provider health first. -**RPC provider health:** Confirm both endpoints are reachable: +**RPC provider health:** ```bash curl -sS -X POST \ @@ -875,155 +757,144 @@ curl -sS -X POST \ --- -## 8. Debugging Guide +## 9. Debugging -### How to Think About Errors +Almost every failure belongs to a specific layer. Identify the layer first, then pull the logs for that component. -Almost every failure in this system belongs to one of several layers, and the fastest way to debug is to decide which layer owns the symptom before you start reading logs. A request travels from the edge (Cloudflare) to the load balancer, to Cloud Run, into the Channels plugin. The plugin then talks to Redis, Pub/Sub, Cloud KMS, and the Stellar RPC. A 5xx returned at the edge is a different problem from a transaction that was accepted, queued, signed, and then rejected by Horizon. +A request that never returns a `tx_id` failed in the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a `tx_id` but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll). Match the symptom to the layer, then pull the logs for that component. -So when something breaks, work in this order: +Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside; each lives in a different layer and has a different fix. -1. **Where did it fail?** A request that never returns a `tx_id` failed before or during the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a `tx_id` but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll). -2. **What layer owns that step?** Match it to a component: auth and rate limits live at the edge and the relayer API, sequence and channel contention live in Redis and the plugin, signing lives in KMS, and the final accept or reject comes from the RPC and Horizon. -3. **Pull the logs for that layer** using the entry points below, then match against the common patterns. - -The point of this ordering is to avoid reading the wrong logs. Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside, but each one lives in a different layer and has a different fix. - -### Entry Points - -| You have | Start with | +| You have | Do this | | --- | --- | | Transaction ID | `oz-relayer tx show -r channels-fund --json -p ` | -| Error message | Search Cloud Logging for the error pattern | -| Time window | `gcloud logging read` with `--freshness` | -| Stellar tx hash | Query Horizon, then work backwards to the relayer's tx record | -| "What's failing right now" | Filter logs by `severity>=ERROR` | +| Error message | Search Cloud Logging: `textPayload:""` | +| "What's broken right now" | `gcloud logging read ... AND severity>=ERROR` | +| Stellar tx hash | Check Horizon, then find the relayer tx record | -### Common Log Patterns +Common log patterns: -| Pattern | What it means | +| Pattern | Means | | --- | --- | -| `provider paused` | RPC failover triggered | -| `sequence`, `counter` | Sequence-number drift or contention | -| `POOL_CAPACITY` | Channel-account pool exhausted | -| `LOCKED_CONFLICT` | Two workers tried to acquire the same channel | -| `TRY_AGAIN_LATER` | Horizon-side throttling | +| `provider paused` | RPC failover kicked in | +| `POOL_CAPACITY` | Channel pool exhausted; bootstrap more | +| `LOCKED_CONFLICT` | Two workers grabbed the same channel | +| `TRY_AGAIN_LATER` | Horizon throttling | -### Redis Inspection +### 9.1. Redis Inspection Connect from a VM in the same VPC: ```bash -redis-cli -h -p +redis-cli -h -p 6379 KEYS *tx:* GET "oz-relayer:relayer:channels-fund:tx:" ``` --- -## 9. Security Model +## 10. Security -Covers secrets handling, network isolation, IAM role assignments, TLS posture, and KMS key management. Review before modifying IAM bindings or network ingress settings. +This section documents the security posture of the deployed infrastructure. Review it before go-live and consult it when rotating credentials or adjusting network ingress rules. -### 9.1 Secrets Handling +### 10.1. Secrets -All secrets are stored in Secret Manager. They are currently passed as plain environment variables to Cloud Run. See Known Issues for the plan to switch to `secret_key_ref` references. +All secrets are stored in Secret Manager, passed as env vars to Cloud Run. See Known Issues for the plan to switch to `secret_key_ref` references. -### 9.2 Network Isolation +### 10.2. Network Isolation -- **Cloud Run ingress:** restricted to internal plus load balancer traffic (`INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER` in production, `INGRESS_TRAFFIC_ALL` for testing). -- **Cloud Run egress:** a VPC Connector with `PRIVATE_RANGES_ONLY`. Private traffic goes through the VPC (to Memorystore), and public traffic (Stellar RPC, KMS API) goes direct. -- **Memorystore:** reachable only through Private Service Access (VPC peering). No public IP. -- **Pub/Sub:** IAM-scoped. Only the Cloud Run service account has publisher and subscriber access to the relayer's topics. +- **Cloud Run ingress:** `INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER` in prod; `INGRESS_TRAFFIC_ALL` for testing. +- **Cloud Run egress:** VPC Connector with `PRIVATE_RANGES_ONLY`. Private traffic goes through the VPC (to Memorystore); public traffic (Stellar RPC, KMS API) goes direct. +- **Memorystore:** Private Service Access only, no public IP. +- **Pub/Sub:** IAM-scoped per topic/subscription. -### 9.3 IAM Least-Privilege +### 10.3. IAM -The Cloud Run service account (`{app_name}-run`) has: +The Cloud Run SA (`{app_name}-run`) gets: -| Role | Scope | Purpose | -| --- | --- | --- | -| `secretmanager.secretAccessor` | Per-secret | Read secrets at startup | -| `monitoring.metricWriter` | Project | Write custom metrics | -| `logging.logWriter` | Project | Write application logs | -| `monitoring.viewer` | Project | Read Pub/Sub backlog depth | -| `cloudkms.signerVerifier` | Per-key | Sign transactions | -| `cloudkms.publicKeyViewer` | Per-key | Read the public key | -| `pubsub.publisher` | Per-topic | Publish job messages | -| `pubsub.subscriber` | Per-subscription | Pull and ack messages | -| `artifactregistry.reader` | Per-repository | Pull container images | - -### 9.4 TLS Posture - -- **Load balancer:** Google-managed SSL certificate, HTTPS on 443, HTTP redirects to HTTPS. -- **Memorystore:** transit encryption is disabled, since Private Service Access provides network-level isolation. Enable it if your compliance requirements call for it and the relayer binary supports TLS (see Known Issues). +| Role | Scope | +| --- | --- | +| `secretmanager.secretAccessor` | per-secret | +| `monitoring.metricWriter` | project | +| `logging.logWriter` | project | +| `monitoring.viewer` | project | +| `cloudkms.signerVerifier` | per-key | +| `cloudkms.publicKeyViewer` | per-key | +| `pubsub.publisher` | per-topic | +| `pubsub.subscriber` | per-subscription | +| `artifactregistry.reader` | per-repo | + +### 10.4. TLS + +- **Load balancer:** Google-managed SSL cert, HTTPS on 443, HTTP redirects to HTTPS. +- **Memorystore:** transit encryption is disabled (see Known Issues). Private Service Access provides network-level isolation. - **Cloudflare to LB:** set the Cloudflare zone SSL mode to "Full" for end-to-end TLS. -### 9.5 Cloud KMS for Stellar Signers +### 10.5. Cloud KMS -- **Key algorithm:** `EC_SIGN_ED25519` (the Stellar-compatible ED25519 curve). -- **Protection level:** `SOFTWARE`. HSM is also supported but adds latency. -- **IAM:** the Cloud Run SA has `signerVerifier` and `publicKeyViewer` on the key. -- **Rotation:** provision a new key, register a new signer and relayer, fund the new on-chain account, drain the old one, then retire it. +`EC_SIGN_ED25519`, SOFTWARE protection. Rotation: provision a new key, register a new signer and relayer, fund the new onchain account, drain the old one, retire it. --- -## 10. Key Gotchas +## 11. Post-Restart Checklist -Operational sharp edges encountered in production deployments. Each item describes a failure mode, its cause, and the fix. +If you ever restart with `RESET_STORAGE_ON_START=true` (which wipes Redis), you need to redo the following (the service will be up but non-functional until these are done): -### 10.1 Channel-Account Exhaustion (`POOL_CAPACITY`) +1. **Re-create the signer:** `./scripts/gcp-kms-signer.sh` ([section 4.9](#49-create-the-signer)) +2. **Re-create the fund relayer:** via the relayer API using the new signer ID +3. **Re-run the RPC override:** the PATCH to `/api/v1/networks/stellar:mainnet` ([section 4.8](#48-override-rpc-endpoints)) +4. **Re-bootstrap channels:** `oz-channels bootstrap --to -p ` ([section 4.10](#410-bootstrap-channels)) +5. **Fund the fund relayer:** if the onchain account was recreated, send XLM to the new address -Sizing formula: +Normal restarts and redeployments (without `RESET_STORAGE_ON_START=true`) preserve everything in Redis; none of the above is needed. -``` -min_pool = ceil(target_TPS x avg_settlement_seconds x safety_factor) -``` +--- -At about 23 TPS sustained, with roughly 5s Stellar settlement and a 1.5x safety factor: `23 x 5 x 1.5 = 173` channels minimum. +## 12. Gotchas -Recovery: `oz-channels bootstrap --from --to `. +Common deployment and operational pitfalls, with fixes. Check here first when something does not behave as expected. -### 10.2 SSL Certificate Provisioning +### 12.1. Channel Pool Exhaustion -Google-managed certificates need DNS to point at the LB IP before they provision. With Cloudflare enabled, you have to temporarily point DNS straight at the LB IP (bypassing the Cloudflare proxy), wait for the cert to become ACTIVE, then switch to the Cloudflare CNAME. +`min_pool = ceil(TPS × settlement_seconds × 1.5)`. At 23 TPS with 5s settlement: 173 channels minimum. Fix: `oz-channels bootstrap --from --to `. - -If the cert is stuck in `FAILED_NOT_VISIBLE` for more than 30 minutes, it usually needs to be recreated. Bump the cert name suffix in `load-balancer.tf` (for example `-cert-v2` to `-cert-v3`) and re-apply. The `create_before_destroy` lifecycle provisions the new cert before removing the old one, so there is no downtime. - +### 12.2. SSL Cert Provisioning + +Google needs DNS pointing at the LB IP before it issues the cert. With Cloudflare, turn proxy off first, wait for ACTIVE, then proxy back on. If the cert stays `FAILED_NOT_VISIBLE` for 30+ min, bump the cert name suffix in `load-balancer.tf` and re-apply (`create_before_destroy` swaps it without downtime). -### 10.3 VPC Connector CIDR Overlap +### 12.3. VPC Connector CIDR Overlap -If you run multiple environments (stg and prod) in the same VPC, each one needs a unique `connector_ip_cidr_range` (for example `10.8.0.0/28` for stg and `10.9.0.0/28` for prod). +Each environment in the same VPC needs a different `/28` CIDR range (e.g. `10.8.0.0/28` for stg, `10.9.0.0/28` for prod). -### 10.4 Private Service Access (Shared Connection) +### 12.4. Private Service Access Shared Connection -A VPC can hold only one Private Service Access connection to `servicenetworking.googleapis.com`. If stg creates it first, prod's apply will fail unless `update_on_creation_fail = true` is set on the `google_service_networking_connection` resource. The module handles this. +A VPC can hold only one Private Service Access connection to `servicenetworking.googleapis.com`. If stg creates it first, prod's apply will fail unless `update_on_creation_fail = true` is set on the connection resource. The module handles this. -### 10.5 Pub/Sub Topic Prefix and Image Compatibility +### 12.5. Pub/Sub Topic Prefix -The `PUBSUB_TOPIC_PREFIX` env var has to match what the container image expects. Different image versions may or may not append a trailing dash to the prefix. If you see "topic does not exist" errors with double dashes (`relayer-mainnet-prod--`), remove the trailing dash from the prefix. If topics are missing entirely (no dash), add it back. +`PUBSUB_TOPIC_PREFIX` must match what the image expects. Double-dash errors (`relayer-mainnet-prod--`) mean the prefix has a trailing dash the image doesn't expect. Adjust via `container_environment` if needed. -### 10.6 STORAGE_ENCRYPTION_KEY Format +### 12.6. Encryption Key Format -The encryption key has to be base64-encoded 32 bytes (44 characters with `=` padding). Generate it with `openssl rand -base64 32`. Hex-encoded keys fail silently with "Invalid key length: expected 32 bytes, got 0". +`storage_encryption_key` must be base64-encoded 32 bytes (`openssl rand -base64 32`). Hex keys fail silently with "Invalid key length: expected 32 bytes, got 0". -### 10.7 Fee-Bump Tuning Under Congestion +### 12.7. Fee-Bump Tuning Under Congestion -Set this through the `MAX_FEE` env var (default `1000000` stroops, which is 0.1 XLM). Under network congestion, raise it to `10000000` (1 XLM). The Channels plugin uses static fees, so it does not dynamically bump on `INSUFFICIENT_FEE`. +`MAX_FEE` defaults to 1M stroops (0.1 XLM). Raise to 10M during network congestion. The plugin uses static fees with no automatic bumping on `INSUFFICIENT_FEE`. --- -## 11. Terraform Variables Reference +## 13. Variables -Complete listing of all module variables. Required variables must be set in `terraform.tfvars`; optional variables document their module defaults here. +Full variable reference for the Terraform module. Required variables must be set in your tfvars file; optional variables have defaults that the module adjusts automatically based on the environment value. -### Required +### 13.1. Required | Name | Type | Description | -| --- | --- | --- | +|------|------|-------------| | `project_id` | `string` | GCP project ID | -| `region` | `string` | GCP region (for example `us-east1`) | -| `environment` | `string` | Deployment environment (`prod`, `stg`). 1 to 16 chars. | +| `region` | `string` | GCP region | +| `environment` | `string` | `prod`, `stg`, etc. (1–16 chars) | | `network` | `string` | VPC network name or self_link | | `subnetwork` | `string` | Subnet name or self_link | | `domain_name` | `string` | FQDN for the service | @@ -1031,106 +902,103 @@ Complete listing of all module variables. Required variables must be set in `ter | `relayer_api_key` | `string` | Relayer API key (sensitive) | | `channels_admin_secret` | `string` | Admin secret (sensitive) | -### Optional, Core +### 13.2. Optional: Core | Name | Type | Default | Description | -| --- | --- | --- | --- | +|------|------|---------|-------------| | `app_name` | `string` | `"relayer-channels"` | Resource name prefix | | `name_suffix_environment` | `bool` | `true` | Append `-{env}` to names (auto-off for prod) | | `labels` | `map(string)` | `{}` | Labels for all resources | -### Optional, Networking +### 13.3. Optional: Networking | Name | Type | Default | Description | -| --- | --- | --- | --- | +|------|------|---------|-------------| | `connector_machine_type` | `string` | `"e2-micro"` | VPC connector machine type | | `connector_min_instances` | `number` | `2` | Min connector instances | | `connector_max_instances` | `number` | `3` | Max connector instances | | `connector_ip_cidr_range` | `string` | `"10.8.0.0/28"` | CIDR for the VPC connector (/28, must not overlap) | -### Optional, Container / Cloud Run +### 13.4. Optional: Container / Cloud Run | Name | Type | Default | Description | -| --- | --- | --- | --- | -| `container_port` | `number` | `8080` | Container port | -| `cpu` | `string` | `"1"` | CPU allocation (`"1"`, `"2"`, `"4"`) | -| `memory` | `string` | `"2Gi"` | Memory allocation | -| `min_instance_count` | `number` | `null` | Min instances. Auto: 2 (prod), 1 (non-prod) | -| `max_instance_count` | `number` | `null` | Max instances. Auto: 10 (prod), 4 (non-prod) | -| `cpu_always_allocated` | `bool` | `null` | Always allocate CPU. Auto: true (prod) | +|------|------|---------|-------------| +| `container_port` | `number` | `8080` | Listen port | +| `cpu` | `string` | `"1"` | CPU (`"1"`, `"2"`, `"4"`) | +| `memory` | `string` | `"2Gi"` | Memory | +| `min_instance_count` | `number` | `null` | Auto: 2 prod, 1 other | +| `max_instance_count` | `number` | `null` | Auto: 10 prod, 4 other | +| `cpu_always_allocated` | `bool` | `null` | Auto: true prod | | `health_check_path` | `string` | `"/api/v1/health"` | Probe path | | `container_environment` | `list(object)` | `[]` | Additional env vars (user overrides win) | -### Optional, Application +### 13.5. Optional: Application | Name | Type | Default | Description | -| --- | --- | --- | --- | +|------|------|---------|-------------| | `stellar_network` | `string` | `"testnet"` | `mainnet` or `testnet` | | `fund_relayer_id` | `string` | `"channels-fund"` | Fund relayer ID | | `distributed_mode` | `bool` | `true` | Enable distributed queue processing | | `queue_backend` | `string` | `"pubsub"` | `pubsub` (recommended) or `redis` | -| `log_level` | `string` | `"warn"` | Application log level | +| `log_level` | `string` | `"warn"` | App log level | -### Optional, Secrets +### 13.6. Optional: Secrets | Name | Type | Default | Description | -| --- | --- | --- | --- | -| `webhook_signing_key` | `string` | `""` | Webhook signing key (sensitive). Only set it if you use webhook notifications, otherwise omit it. | -| `storage_encryption_key` | `string` | `""` | Encrypts data at rest in Redis. Must be base64-encoded 32 bytes (sensitive). Strongly recommended for production. | +|------|------|---------|-------------| +| `webhook_signing_key` | `string` | `""` | Only set if using webhooks | +| `storage_encryption_key` | `string` | `""` | Base64-encoded 32 bytes. Recommended for prod. | -### Optional, Redis +### 13.7. Optional: Redis | Name | Type | Default | Description | -| --- | --- | --- | --- | -| `redis_tier` | `string` | `null` | `BASIC` or `STANDARD_HA`. Auto per environment. | -| `redis_memory_size_gb` | `number` | `null` | Memory in GB. Auto: 5 (prod), 1 (non-prod). | +|------|------|---------|-------------| +| `redis_tier` | `string` | `null` | `BASIC` or `STANDARD_HA` (auto per env) | +| `redis_memory_size_gb` | `number` | `null` | Auto: 5 prod, 1 other | | `redis_version` | `string` | `"REDIS_7_2"` | Redis version | -### Optional, Cloudflare +### 13.8. Optional: Cloudflare | Name | Type | Default | Description | -| --- | --- | --- | --- | -| `enable_cloudflare` | `bool` | `false` | Enable the Cloudflare Workers gateway | +|------|------|---------|-------------| +| `enable_cloudflare` | `bool` | `false` | Enable Workers gateway | | `cloudflare_zone_id` | `string` | `""` | Required when Cloudflare is enabled | | `cloudflare_account_id` | `string` | `""` | Required when Cloudflare is enabled | -| `relayer_static_api_key` | `string` | `""` | Static API key injected by the Worker upstream (sensitive). Use the same value as `relayer_api_key`. | -| `key_salt` | `string` | `""` | Salt for hashing user API keys before storing in KV (sensitive). Generate with `openssl rand -base64 32`. | +| `relayer_static_api_key` | `string` | `""` | Static key injected by the Worker (sensitive) | +| `key_salt` | `string` | `""` | Salt for hashing user keys in KV (sensitive) | | `gen_ip_rate_hour` | `number` | `2` | Max `/gen` per IP per hour | | `relay_rpm_per_key` | `number` | `60` | Max relay RPM per key | -### Optional, Load Balancer +### 13.9. Optional: Load Balancer | Name | Type | Default | Description | -| --- | --- | --- | --- | -| `lb_deletion_protection` | `bool` | `null` | Auto: true (prod), false (non-prod) | -| `lb_log_sample_rate` | `number` | `0` | Request log sampling (0 disables it) | +|------|------|---------|-------------| +| `lb_deletion_protection` | `bool` | `null` | Auto: true prod | +| `lb_log_sample_rate` | `number` | `0` | Request log sampling (0 disables) | + +See `variables.tf` for the full list including Cloud Functions and additional networking options. -### Outputs +--- + +## 14. Outputs + +The module exposes these outputs for use in downstream Terraform modules or post-deployment scripts. | Name | Description | -| --- | --- | -| `cloud_run_service_name` | Cloud Run service name | -| `cloud_run_service_uri` | Cloud Run service URI (internal) | -| `cloud_run_service_account_email` | Cloud Run service account email | -| `load_balancer_ip` | Global static IP of the HTTPS LB | -| `domain_name` | Service domain name | -| `redis_host` / `redis_port` / `redis_read_endpoint` | Memorystore connection info | -| `pubsub_topics` / `pubsub_subscriptions` | Map of queue names to Pub/Sub resource names | -| `secret_ids` | Map of secret names to Secret Manager IDs | +|------|-------------| +| `cloud_run_service_name` / `cloud_run_service_uri` | Service name and URL | +| `load_balancer_ip` | Static IP for DNS | +| `redis_host` / `redis_port` / `redis_read_endpoint` | Memorystore connection | +| `pubsub_topics` / `pubsub_subscriptions` | Queue resource names | | `kms_key_ring_name` / `kms_signing_key_name` / `kms_signing_key_id` | Cloud KMS key info | | `artifact_registry_repository` / `artifact_registry_url` | Artifact Registry info | +| `secret_ids` | Secret Manager IDs | | `cloudflare_worker_name` | Worker name (null if disabled) | --- -## 12. Known Issues - -Tracked limitations with current workarounds. These are active constraints, not historical bugs. - -### Memorystore Redis TLS - -Transit encryption is disabled because the relayer binary is not compiled with TLS support for Redis connections. This is acceptable because Memorystore is reachable only through Private Service Access (VPC peering), so traffic never leaves Google's network. +## 15. Known Issues -### Secret Manager References +**Redis TLS disabled:** the relayer binary doesn't support TLS for Redis connections. Memorystore is only reachable via Private Service Access (VPC peering), so traffic stays within Google's network. -Secrets are currently passed as plain environment variables to Cloud Run instead of using `secret_key_ref` Secret Manager references. This is a workaround for a 0-byte issue hit during the initial deployment. The plan is to switch back to Secret Manager references for a better security posture. +**Secrets as plain env vars:** secrets are passed as Cloud Run env vars rather than Secret Manager `secret_key_ref` references. This is a workaround for a deployment issue. Plan to switch to proper secret references. diff --git a/content/relayer/guides/stellar-relayer-aws-operator-guide.mdx b/content/relayer/guides/stellar-relayer-aws-operator-guide.mdx new file mode 100644 index 00000000..5d9724f5 --- /dev/null +++ b/content/relayer/guides/stellar-relayer-aws-operator-guide.mdx @@ -0,0 +1,1680 @@ +--- +title: 'Hosted Stellar Relayer on AWS: Operator Deployment Guide' +--- + +A step-by-step guide for infrastructure teams (such as Blockdaemon or SDF) deploying a hosted Stellar relayer service that mirrors OpenZeppelin's existing production setup. + +**Audience:** infrastructure operators who have run production AWS workloads but are new to OpenZeppelin's relayer stack. + +**Outcome:** a hosted Stellar Channels service in your own AWS account capable of serving the same workload profile OpenZeppelin currently runs (roughly 2M+ transactions per day across roughly 2,500 relayers). + +For the GCP deployment, see the [GCP Operator Deployment Guide](/relayer/guides/stellar-relayer-gcp-operator-guide). + +--- + +## 1. Overview + +OpenZeppelin currently runs a hosted Stellar relayer service at `channels.openzeppelin.com` (mainnet) and `channels.openzeppelin.com/testnet` (testnet). The service absorbs the operational complexity of parallel Stellar transaction submission (channel-account pool management, fee bumping, sequence-number arbitration, multi-RPC failover) and exposes a simple HTTP API to downstream callers. + +This guide is for infrastructure teams deploying a hosted relayer service for SDF providing the same throughput as OpenZeppelin. Blockdaemon is the first such operator; this guide is written to be portable to others. + +### What You Will End Up With + +After following this guide, you will have: + +- A production-ready hosted Stellar Channels service running in your own AWS account, exposed at a domain you control (for example, `channels.your-company.com`). +- An ECS Fargate-backed compute tier with autoscaling, fronted by an Application Load Balancer with TLS 1.3. +- ElastiCache Redis (in production: multi-AZ with failover) for state and rate-limit accounting. +- Eight SQS queues and DLQs handling the distributed transaction-processing pipeline. +- Optional Cloudflare Worker fronting the ALB for self-serve API-key issuance (the `/gen` flow), per-user rate limiting, and usage analytics. +- AWS SSM Parameter Store SecureString entries for every secret. No secrets in environment variables, no secrets in container images. +- **Observability:** CloudWatch Logs and CloudWatch Metrics by default. Optionally, an Amazon Managed Prometheus workspace that remote-writes the same metric set if you operate your own Grafana or alerting stack. +- **Alerting:** CloudWatch Alarms wired to SNS topics that fan out to PagerDuty (or your on-call channel of choice). The module provisions the alarm resources but leaves `alarm_actions` empty by default so you bind the SNS topic ARNs that route to your existing incident pipeline. +- Optional Lambda functions for fund-relayer balance monitoring and ECS auto-restart on alarm. + +The system handles two transaction-submission modes: + +- **Signed XDR mode:** the caller signs a complete Stellar transaction envelope and submits it; the service only handles fee-bumping and submission. +- **Soroban `func` + `auth` mode:** the caller submits a Soroban host function and authorization entries; the service assembles, simulates, signs with a channel account, fee-bumps, and submits. + +### What This Guide Assumes You Already Have + +- Strong AWS infrastructure background (VPC, ECS, ALB, IAM, Route53, ACM, ElastiCache, SQS). +- Terraform fluency (1.5.0 or later). +- A target AWS account where you can create the full resource set, or an account-pair pattern with Route53 in a separate account (cross-account assume-role is supported). +- A domain in Route53 you control. +- (Optional) A Cloudflare account if you want the `/gen` API-key gateway. + +If you are looking for your own development or any other use cases which serve lower throughput, see the upstream Stellar Operator Guide (different audience,, different deployment shape. + +--- + +## 2. Architecture + +### Cloud Architecture + +```mermaid +flowchart TD + Callers([Public callers]) + + subgraph Edge["Edge (Cloudflare, optional)"] + Worker["Cloudflare Worker
• /gen + /testnet/gen — issues API keys
• KV-backed auth, hashes with KEY_SALT
• per-IP / per-key rate limits
• rewrites Bearer→static, sets x-consumer-key
• usage tracking via Analytics Engine"] + end + + subgraph AWSEdge["AWS Edge"] + ALB["Application Load Balancer
TLS 1.3 · HTTPS-only · HTTP→HTTPS redirect
ingress restricted to Cloudflare IPs"] + end + + subgraph Compute["Compute"] + Tasks["ECS Fargate Service
relayer container · autoscaling 2..N tasks
health: /api/v1/health · optional CW sidecar"] + end + + subgraph State["Data plane"] + Redis[("ElastiCache Redis
multi-AZ failover")] + SQS[("SQS — 8 queues + DLQs")] + SSM[("SSM Parameter Store
SecureString secrets")] + end + + subgraph Observability["Observability"] + CW["CloudWatch alarms
queue depth · DLQ age · ECS health"] + end + + Stellar([Stellar RPC
Soroban + Horizon]) + ECR[(ECR — image source)] + + Callers --> Worker + Worker -->|"Bearer = static-key
x-consumer-key = user-key"| ALB + ALB --> Tasks + Tasks --> Redis + Tasks --> SQS + Tasks --> SSM + Tasks --> Stellar + ECR -.->|image pull| Tasks + SQS -.-> CW + Tasks -.-> CW +``` + +**Module:** the entire stack above is provisioned by the `relayer-channels` Terraform module in `OpenZeppelin/relayer-channels-infra`. Operators consume it either by cloning the repo (standalone mode) or referencing it as an external module from their own Terraform. + +**Components at a Glance:** + +| Component | AWS resource | Purpose | +| --- | --- | --- | +| Edge gateway | Cloudflare Worker + KV namespace (optional) | API-key issuance, per-key rate limiting, usage tracking, static-key injection upstream | +| Load balancer | Application Load Balancer + ACM cert | TLS termination, HTTPS-only, health-checked routing to Fargate | +| Compute | ECS Fargate Service (`launch_type = "FARGATE"`) | Runs the relayer container (and optional metrics sidecar). Autoscaling by CPU. | +| State | ElastiCache Redis 7.1 replication group, in-transit TLS | Relayer state (transaction records, sequence counters), distributed locks, rate-limit buckets | +| Queue | 8 SQS standard queues + 8 DLQs | Distributed transaction processing (request → submit → status check → notification, etc.) | +| Secrets | SSM Parameter Store `SecureString` | `API_KEY`, `PLUGIN_ADMIN_SECRET`, `WEBHOOK_SIGNING_KEY`, `STORAGE_ENCRYPTION_KEY` | +| Observability | CloudWatch Logs + Metrics + (optional) Amazon Managed Prometheus | App logs (JSON format), per-queue depth alarms, optional metrics-remote-write | +| Image registry | ECR Public (module-created) or your own ECR | Container image source for the Fargate task | +| Signing | KMS (out-of-module, operator-provisioned per relayer-side signer config) | ED25519 keys for the fund relayer + channel-account signers | +| Optional monitors | Lambda + EventBridge | Balance-check Lambda; ECS restart-on-alarm Lambda | + +### App Architecture (Channels Plugin Runtime) + +```mermaid +flowchart TD + Client([API Client]) + + subgraph Relayer["Relayer API (openzeppelin-relayer)"] + Auth["Bearer auth (API_KEY from SSM)
+ rate-limit middleware
+ route to plugin"] + end + + subgraph Plugin["Channels Plugin Runtime"] + Pipeline["Submission pipeline
1. Validation — auth entries, payload, scheme
2. ChannelPool — acquire a channel relayer
3. Build + Simulate — assemble Soroban tx
4. Sign + FeeBump — channel signs; fund FeeBumps
5. Submit + Wait — POST to RPC, poll status"] + Mgmt["Management API
setChannelAccounts / listChannelAccounts
setFeeLimit / getFeeUsage / getFeeLimit"] + end + + Redis[("Redis
state")] + SQS[("SQS
jobs")] + Accts[("Fund acct
+ channel accts
(KMS-backed)")] + Stellar([Stellar RPC]) + + Client -->|"POST /api/v1/plugins/channels/call
body: { params: { xdr } } OR { params: { func, auth } }"| Auth + Auth --> Pipeline + Auth --> Mgmt + Pipeline <--> Redis + Pipeline <--> SQS + Mgmt <--> Redis + Pipeline -->|sign| Accts + Accts -->|signed envelope| Stellar + Pipeline -->|submit + poll| Stellar +``` + +**Source references:** +- Relayer API: `openzeppelin-relayer` +- Channels Plugin: `relayer-plugin-channels` (see `src/plugin/` for the runtime, `src/client/` for the TypeScript SDK) +- The Docker image deployed to Fargate is built from `openzeppelin-relayer/examples/channels-plugin-example` + +### Transaction Lifecycle + +End-to-end flow for a Soroban `func` + `auth` submission through the hosted service. + +```mermaid +sequenceDiagram + autonumber + actor Caller + participant CF as CF Worker + participant ALB + participant API as Relayer API + participant Plugin as Channels Plugin + participant Redis + participant SQS + participant KMS + participant RPC as Soroban RPC + + Caller->>CF: POST / · Bearer user-key + CF->>CF: hash + KV lookup
+ scope check + CF->>ALB: rewrite Bearer→static-key
set x-consumer-key=user-key + ALB->>API: TLS terminate · forward + API->>Plugin: route /plugins/channels/call + Plugin->>Redis: check fee budget + Plugin->>Redis: persist tx record + Plugin->>SQS: enqueue transaction-request + Plugin-->>Caller: 202 Accepted + tx_id + + rect rgba(200, 220, 255, 0.4) + Note over Plugin,RPC: Async worker pickup (after 202 returns) + Plugin->>Redis: acquire channel account + Plugin->>RPC: build + simulate tx + RPC-->>Plugin: assembled envelope + Plugin->>KMS: sign w/ channel signer + KMS-->>Plugin: signature + Plugin->>KMS: fee-bump w/ fund signer + KMS-->>Plugin: fee-bumped envelope + Plugin->>RPC: submit signed envelope + RPC-->>Plugin: submitted (no hash yet) + Plugin->>SQS: enqueue status-check-stellar + + loop until confirmed or expired + Plugin->>RPC: GET tx by hash + RPC-->>Plugin: pending / confirmed + end + + Plugin->>Redis: update tx record → confirmed + Plugin->>SQS: enqueue notification + end + + Plugin->>Caller: webhook (signed with WEBHOOK_SIGNING_KEY) +``` + +**What Each Stage Costs:** + +| Stage | Latency contributors | +| --- | --- | +| CF Worker auth | KV lookup (~10ms) + sha256 hash | +| ALB to Fargate | TLS termination + intra-VPC hop (~1-5ms) | +| Validation | Redis lookups for fee budget (~1ms each) | +| Channel acquire | Redis distributed lock (~1ms; queue wait if pool exhausted) | +| Build + simulate | Soroban RPC `simulateTransaction` (~50-200ms) | +| Sign + fee-bump | KMS `Sign` × 2 (channel + fund) (~10-50ms each, region-local) | +| Submit | Soroban RPC `sendTransaction` (~10-100ms) | +| Status check | Per-poll RPC call (~10-50ms); ledger settlement adds ~5s base | + +The 202 response is returned synchronously; the rest happens asynchronously via SQS workers. Status is queryable any time via `oz-relayer tx show `. + +### Capacity Profile + +The reference deployment OpenZeppelin runs handles a **growing ~3M transactions per day** sustained, served by **~1,000 relayers** (fund and channel-account entities combined). Two recent windows: + +| Window | Total tx (7d) | Daily avg | Sustained tx/s | Peak day | Peak tx/s | +| --- | --- | --- | --- | --- | --- | +| Apr 28 – May 4 | 19.19M | 2.74M | ~31.7 | May 4 (3.90M) | ~45 | +| **May 5 – May 11** | **20.88M (+8.8% WoW)** | **2.98M** | **~34.5** | **May 8 (3.67M)** | **~42.5** | + +The deployment is **trending up** WoW (+8.8% in the most recent window) and routinely absorbs daily peaks **~25–30% above the 7-day average**. Plan for headroom; autoscaling minimums should comfortably cover the **peak day**, not the average. + +**Traffic concentration.** In both windows, **~99%+ of all transactions terminate at a small set of high-volume Soroban contracts** registered in `LIMITED_CONTRACTS`. The top contract alone accounts for **73–97% of daily volume** depending on its onchain phase, with the second contributing most of the remainder. Non-limited contracts are below sampling resolution (0.2% or less). This is what the contract-capacity-ratio knob is sized against. The full env-var tuning is in section 6. + +**The Terraform module defaults are sized for `environment = "prod"` workloads but tuned conservatively.** For reference, here is the actual production configuration OpenZeppelin runs at this scale (sanitized): + +| Resource | Module default (per `relayer-channels-infra`) | OpenZeppelin production (actual resource capacity) | +| --- | --- | --- | +| ECS task CPU | 1024 (1 vCPU) | **8192 (8 vCPU)** | +| ECS task memory | 2048 MiB | **16384 MiB** | +| ECS desired count | 2 | **11** | +| ECS autoscaling min | 2 | **11** | +| ECS autoscaling max | 10 | **25** | +| Container CPU (within task) | 1024 | **6144** (rest reserved for sidecars) | +| Container memory (within task) | 2048 | **9216** | +| Redis node type | `cache.t4g.medium` (non-prod) / `cache.r7g.large` (prod) | `cache.r7g.large` family (multi-AZ failover) | +| Redis pool max size | 500 | **3000** | +| Redis reader pool max size | 1000 | **3000** | +| Max connections (relayer) | 256 | **4000** | +| Rate limit (req/sec) | 100 | **400** | +| Rate limit burst | 300 | **500** | + +The module defaults are operationally fine for a new deployment ramping up; expect to grow into something closer to the production shape as your workload approaches the OpenZeppelin scale. The full env-var tuning is in section 6. + +**Sidecar pattern:** OpenZeppelin runs a `cloudwatch-exporter` sidecar in every task that scrapes `:8081/debug/metrics/scrape` and pushes Prometheus metrics into CloudWatch under namespaces like `RelayerChannelsMainnetTransactions`. The module exposes this as an optional feature via `enable_cloudwatch_exporter` and `cloudwatch_exporter_image`. + + +Traffic figures above are drawn from internal traffic analysis covering Apr 28 – May 11 on the channels-fund mainnet deployment (CloudWatch + per-contract instrumentation). Refresh this section quarterly or whenever a new WoW snapshot is generated. + + +### SQS Queue Topology + +The relayer's distributed processing layer is eight SQS standard queues, each backed by a Dead Letter Queue. Producers, consumers, and DLQ relationships: + +```mermaid +flowchart TD + subgraph Producers["Producers"] + APIReq[API request] + WorkerCb[Worker callback] + Cron[Cron sweep] + Health[Health probe] + end + + subgraph MainQueues["8 SQS main queues — per-queue tuning"] + Q1["transaction-request
vis 300s · max-recv 6"] + Q2["transaction-submission
vis 120s · max-recv 2 ⚠"] + Q3["status-check
vis 300s · max-recv 1000"] + Q4["status-check-evm
vis 300s · max-recv 1000"] + Q5["status-check-stellar
vis 300s · max-recv 1000"] + Q6["notification
vis 180s · max-recv 6"] + Q7["token-swap-request
vis 300s · max-recv 6"] + Q8["relayer-health-check
vis 300s · max-recv 6"] + end + + Workers["ECS Fargate workers
One pool per JobType
Concurrency: BACKGROUND_WORKER_*_CONCURRENCY"] + + DLQs[("8 Dead Letter Queues
7-day retention
Inspect: aws sqs receive-message
Re-drive: aws sqs start-message-move-task")] + + Alarms["CloudWatch Alarms (3 per queue, 24 total)
• <prefix>-<queue>-high-depth (5k or 10k)
• <prefix>-<queue>-dlq-messages (>100)
• <prefix>-<queue>-old-messages (vis × 3)
alarm_actions=[] by default — wire to SNS/PD"] + + Producers --> MainQueues + MainQueues -->|normal consume| Workers + MainQueues -. exceeded max-recv .-> DLQs + Workers -. enqueue follow-up .-> MainQueues + MainQueues -.-> Alarms + DLQs -.-> Alarms +``` + +**Why the Per-Queue Tuning Matters:** + +- `transaction-submission` has `max-recv = 2` because a failed RPC submission should not retry indefinitely (retrying a maybe-submitted tx risks double-spend semantics on the chain). Two attempts, then DLQ for human inspection. +- `status-check-*` queues have `max-recv = 1000` because status polling is *expected* to retry many times before a transaction confirms. Long ledger settlement means many poll attempts. A DLQ entry here means the tx never confirmed despite ~1000 polling rounds. +- Other queues sit at `max-recv = 6`, a reasonable default for retriable transient failures. + +--- + +## 3. Prerequisites + +### Accounts and Access + +- **AWS account** with permissions to create: ECS clusters/services, ECR Public repositories, Application Load Balancers, ACM certificates, Route53 records, ElastiCache replication groups, SQS queues, IAM roles/policies, SSM Parameter Store, CloudWatch Logs + Metrics + Alarms, Lambda functions, EventBridge rules, Amazon Managed Prometheus workspaces. (Cross-account variants supported via assume-role; see section 5.) +- **Route53 hosted zone** for the domain you want to serve from (for example, `channels.your-company.com`). The zone can live in the same account or in a different one (cross-account assume-role). +- **(Optional) Cloudflare account** with a zone matching your domain. Required if you want the `/gen` API-key flow, per-IP/per-key rate limiting at the edge, and Cloudflare Analytics-Engine-backed usage tracking. +- **GitHub account** with access to the four reference repositories listed below. None of the repositories are required to be forked; you can consume Terraform modules directly via the `source` block. + +### Tooling + +| Tool | Version | Why | +| --- | --- | --- | +| Terraform | ≥ 1.5.0 | Module language constraints | +| AWS provider | < 6.0.0 | Pinned in `versions.tf` | +| Cloudflare provider | ~> 5.0 | Required as a provider even when `enable_cloudflare = false` (Terraform constraint) | +| Docker | recent stable | If building the container image rather than consuming the published one | +| AWS CLI v2 | recent stable | For ECR login, SSM updates, manual debugging | +| Node.js ≥ 18 + pnpm ≥ 10 | recent stable | If you intend to modify the Channels plugin (uncommon; most operators consume the published npm package) | + +### Stellar-Side Prerequisites + +- **Soroban RPC access:** at least two independent providers for mainnet (for example, Stellar Foundation plus a commercial provider). The module does not provision RPC; you configure RPC URLs at the relayer configuration layer. +- **KMS keys for signers:** for production, you will create one AWS KMS key per fund relayer (ED25519 key spec, asymmetric sign). The **fund relayer** is the Stellar account that signs and pays for the **fee-bump envelope** wrapping every channel-account submission: channel signers sign the inner transaction with their own keys, and the fund relayer's fee-bump signature is what commits XLM to confirm the bundle onchain. So every successful submission consumes (a) one channel-signer signature and (b) one fund-relayer signature plus its inclusion fee. Channel-account signers may use the encrypted local keystore pattern (development-only); for production, both should be KMS-backed. See Section 9 for the full security framing. +- **Initial XLM funding:** bootstrapping happens in two explicit steps: + 1. **Fund the fund relayer's Stellar account.** On **mainnet**, this is a manual one-time top-up sent from your treasury or an exchange to the fund relayer's address. On **testnet**, fund it via Friendbot. + 2. **Bootstrap channel accounts from that balance.** `oz-channels bootstrap --to N` creates `N` channel accounts and sends `--starting-balance` XLM (default **2 XLM**) to each, drawing from the fund relayer. + + **Sizing the fund-relayer balance:** + + - **Provisioning (one-time):** `2 XLM × N` channel accounts. A 1,000-channel pool requires **at minimum 2,000 XLM** in the fund relayer before `bootstrap` can run. + - **Operating buffer (ongoing):** covers fee bumps for live traffic. At ~34 tx/s sustained (per section 2) and a 100-stroop base fee, a multi-day buffer typically runs in the range of tens to hundreds of XLM depending on congestion-driven fee multipliers. Top up via the balance-monitoring Lambda described in section 7. + +### Reference Repositories + +You will refer to three repositories during deployment: + +| Repo | Role | Visibility | +| --- | --- | --- | +| `OpenZeppelin/relayer-channels-infra` | Primary Terraform modules | Public | +| `OpenZeppelin/openzeppelin-relayer`, `examples/channels-plugin-example` | The example used to build the Docker image | Public | +| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | Public | + +--- + +## 4. Environments + +OpenZeppelin's reference deployment uses three environments. We recommend operators mirror this separation: + +| Environment | Stellar network | AWS profile pattern | ECS cluster | Log group | +| --- | --- | --- | --- | --- | +| `prod-mainnet` | Stellar Mainnet | Production AWS account | `relayer-channels-prod-mainnet-cluster` | `/aws/ecs/relayer-channels-prod-mainnet/task` | +| `prod-testnet` | Stellar Testnet | Production AWS account | `relayer-channels-prod-testnet-cluster` | `/aws/ecs/relayer-channels-prod-testnet/task` | +| `stg` | Stellar Testnet | Staging AWS account | `relayer-channels-stg-cluster` | `/aws/ecs/relayer-channels-stg/task` | + +The cluster and log-group naming is auto-derived by the module from `app_name` + `environment`. When `environment = "prod"` the resource-name suffix is dropped; for other environments, names are suffixed with `-`. + +### Configuration Shape + +Operators typically maintain a small structured config that maps environment names to their AWS profile, region, ECS cluster, log group, and Stellar Horizon endpoint. This avoids hard-coding values in operational scripts and CI/CD. A reasonable shape: + +```yaml +environments: +prod-mainnet: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-prod-mainnet-cluster +log_group: /aws/ecs/relayer-channels-prod-mainnet/task +horizon: https://horizon.stellar.org +stellar_network: mainnet +relayer_id: channels-fund +prod-testnet: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-prod-testnet-cluster +log_group: /aws/ecs/relayer-channels-prod-testnet/task +horizon: https://horizon-testnet.stellar.org +stellar_network: testnet +relayer_id: channels-fund +stg: +aws_profile: +aws_region: us-east-1 +ecs_cluster: relayer-channels-stg-cluster +log_group: /aws/ecs/relayer-channels-stg/task +horizon: https://horizon-testnet.stellar.org +stellar_network: testnet +relayer_id: channels-fund +``` + +This same shape is consumed by the operator CLIs (`oz-relayer`, `oz-channels`) described in section 7; they read profile config from `~/.config/oz-relayer/config.yaml` and `~/.config/oz-channels/config.yaml`. + +--- + +## 5. Step-by-Step Deployment + +This section walks the happy-path deployment using the Terraform module in standalone mode. After the first deploy, day-2 operations are described in section 7. + +### Step 5.1: Clone the Repository + +```bash +git clone https://github.com/OpenZeppelin/relayer-channels-infra.git +cd relayer-channels-infra +``` + + +Cloning the repo is useful for exploring the code and contributing. You can also consume the Terraform module directly without cloning, by referencing it via the `source` block from your own Terraform configuration. + + +The repo layout: + +``` +relayer-channels-infra/ +├── main.tf # Root module — instantiates the relayer-channels submodule +├── variables.tf # Root variables +├── outputs.tf # Root outputs +├── versions.tf # Terraform + provider versions +├── terraform.tfvars.example # Annotated example tfvars +└── modules/ + └── relayer-channels/ # The actual module + ├── main.tf + ├── ecs.tf + ├── sqs.tf + ├── redis.tf + ├── cloudflare.tf + ├── dns.tf + ├── lambda.tf + ├── prometheus.tf + ├── worker.mjs # Cloudflare Worker source + ├── relayer_balance.mjs # Balance-check Lambda source + └── restart_ecs_on_alarm.mjs # ECS-restart Lambda source +``` + +### Step 5.2: Configure Terraform Backend + +In `versions.tf`, uncomment and edit the `backend "s3"` block so Terraform state is stored remotely (do not store state on a laptop in production). Example: + +```hcl +terraform { + required_version = ">= 1.5.0" + + backend "s3" { + bucket = "your-org-terraform-state" + key = "relayer-channels/prod-mainnet.tfstate" + region = "us-east-1" + dynamodb_table = "your-org-terraform-locks" + encrypt = true + } +} +``` + +Initialize: + +```bash +terraform init +``` + +### Step 5.3: Create Your tfvars + +Copy the example: + +```bash +cp terraform.tfvars.example terraform.tfvars +``` + +The example file (annotated) covers the full surface. The minimum set required for a first standalone deploy: + +```hcl +aws_region = "us-east-1" +environment = "prod-mainnet" # or "prod-testnet" / "stg" +vpc_id = "vpc-XXXXXXXXXXXXXXXXX" +vpc_cidr = "172.31.0.0/16" +public_subnet_ids = ["subnet-AAA", "subnet-BBB"] # 2+ AZs required + +domain_name = "channels.your-company.com" +route53_zone_name = "your-company.com" # OR route53_zone_id + +# Container image — either OpenZeppelin's published image or your own ECR +container_image = "public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2" # look up via the ECR Public Gallery + +# Secrets — never commit these. Set via TF_VAR_ or a secrets-managed CI pipeline +relayer_api_key = "" # required — set via TF_VAR_relayer_api_key +channels_admin_secret = "" # required — set via TF_VAR_channels_admin_secret + +# Stellar +stellar_network = "mainnet" +fund_relayer_id = "channels-fund" +distributed_mode = true # production: true; backed by SQS +log_level = "warn" +``` + +Pass secrets as environment variables to avoid them ever touching the working directory: + +```bash +export TF_VAR_relayer_api_key="$(openssl rand -hex 32)" # ≥ 32 chars; relayer enforces minimum length +export TF_VAR_channels_admin_secret="$(openssl rand -hex 32)" # admin secret for management API +export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)" # if using webhooks +export TF_VAR_storage_encryption_key="$(openssl rand -hex 32)" # for at-rest encryption in Redis +``` + +### Step 5.4: Decide on the Container Image Strategy + +Two options: + +**Option A: Consume OpenZeppelin's Published Image (recommended).** OpenZeppelin publishes pre-built images to ECR Public at `public.ecr.aws//openzeppelin-relayer-channels:` (look up the live alias from the ECR Public Gallery). The image bundles `openzeppelin-relayer` compiled from a pinned `main` revision with the `@openzeppelin/relayer-plugin-channels` package, runs as `nonroot` (UID 65532) on a Wolfi base, and ships with public Stellar RPC endpoints baked in (no secrets, no paid-RPC URLs). It is the same image OpenZeppelin runs in production. + +Tag scheme: + +| Tag pattern | Points at | +| --- | --- | +| `mainnet-` (for example `mainnet-1.4.2`) | Stellar mainnet build, pinned. **Use this in production.** | +| `mainnet-latest` | Most recent mainnet build. Convenient for dev; will move under you. | +| `testnet-` / `testnet-latest` | Stellar testnet equivalents. | + +```hcl +container_image = "public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2" +``` + +Build provenance and SBOM attestations are attached to every push. Verify with: + +```bash +docker buildx imagetools inspect \ + public.ecr.aws//openzeppelin-relayer-channels:mainnet-1.4.2 \ + --format '{{ json .Provenance }}' +``` + +The published image is built and pushed by OpenZeppelin's internal CI pipeline (one workflow per network: mainnet and testnet). The pipeline refuses to publish if any private-RPC pattern is detected in the baked `stellar.json`, which is the guardrail that makes the public image safe for downstream operators to consume. You don't need access to that pipeline to deploy; the published image at `public.ecr.aws//openzeppelin-relayer-channels:` is the contract. + +**Option B: Build Your Own Image** from the example and publish to your own ECR. Leave `container_image = ""` in tfvars; the module will create an ECR Public repository for you. Build steps mirror what the OpenZeppelin workflows do: + +```bash +git clone https://github.com/OpenZeppelin/openzeppelin-relayer.git +cd openzeppelin-relayer/examples/channels-plugin-example +# Build the Channels plugin TypeScript wrapper +(cd channel && pnpm install && pnpm run build) +# Build the Docker image +docker build -t my-relayer-channels:latest -f ../../Dockerfile.production . +``` + +You can run this build locally as shown, or wire it into the CI tool of your choice (GitHub Actions, GitLab CI, CircleCI, Buildkite, etc.). The build itself is a standard `docker build`; no OpenZeppelin-specific tooling is required to reproduce it. + +**What the public image baked-in config does NOT include** (you provide at runtime via mounted `config.json` and env vars): + +- Relayer definitions (`relayers[]`) +- Signer keys +- Redis (state and optional queue backend) +- SQS queues (if running in distributed mode) +- IAM roles for KMS / SQS / Secrets Manager / CloudWatch + +To override the baked public-RPC endpoints with paid/private RPCs, mount your own `stellar.json` at `/app/config/networks/stellar.json`. The file format mirrors the default: one entry per Stellar network (`mainnet`, `testnet`), each with a `rpc_urls[]` array of `{ url, weight }` objects. The relayer load-balances across the listed URLs by weight and rotates on failures (see the `RPC_*` and `PROVIDER_*` env vars in section 6 for failover tuning). + +After the first `terraform apply` (next step), push to the module-created ECR: + +```bash +aws ecr-public get-login-password --region us-east-1 \ + | docker login --username AWS --password-stdin public.ecr.aws + +# Tag using the ECR URL output by terraform +ECR_URL=$(terraform output -raw ecr_repository_url) +docker tag my-relayer-channels:latest "$ECR_URL:latest" +docker push "$ECR_URL:latest" +``` + +### Step 5.5: Plan and Apply + +```bash +terraform plan -out plan.tfplan +terraform apply plan.tfplan +``` + +Initial apply takes ~15–20 minutes (ElastiCache provisioning is the slowest leg; the ALB and ACM cert validation also take a few minutes each). + +Outputs include: + +| Output | Used for | +| --- | --- | +| `ecs_cluster_name` | Manual ECS Exec, AWS CLI scripting | +| `ecs_service_name` | Service updates, manual restart | +| `ecr_repository_url` | Container image pushes (Option B) | +| `alb_dns_name` | Direct ALB access if Cloudflare is disabled | +| `domain_name` | Public service URL | +| `redis_primary_endpoint`, `redis_reader_endpoint` | Manual Redis CLI access (via bastion) | +| `sqs_queue_urls` | Map of queue names to URLs for direct SQS inspection | +| `prometheus_endpoint` | If AMP enabled, the remote-write endpoint | +| `ssm_parameter_prefix` | Path prefix for SSM secret manipulation | + +### Step 5.6: Verify the Service Is Up + +```bash +# Health check +curl -sS https://channels.your-company.com/api/v1/health +# Expect: 200 OK with body "OK" + +# Readiness — checks Redis, queue, plugin +curl -sS https://channels.your-company.com/api/v1/ready +# Expect: 200 with JSON { "ready": true, "status": "healthy", ... } +``` + +If either fails, the same checks are available via CLI or the AWS Console: + +- **ECS service events:** + - CLI: `aws ecs describe-services --cluster --services ` + - Console: **ECS → Clusters → `` → Services → `` → Events**. +- **Task logs:** + - CLI: `aws logs tail "/aws/ecs/-/task" --since 10m --follow` + - Console: **CloudWatch → Log groups → `/aws/ecs/-/task` → Live tail**. +- **Target health (ALB):** + - Console: **EC2 → Target groups → `` → Targets**. Useful when health checks fail but tasks look running. + +The most common first-deploy failures are: secrets not set (relayer panics with `Security error: API_KEY must be at least 32 characters long`), Redis not yet healthy (wait ~5 more minutes), or ACM cert validation lag (verify the Route53 alias was created). + +### Step 5.7: Enable Cloudflare (If Using the `/gen` Flow) + +If you set `enable_cloudflare = true` in step 5.3, the module provisions: + +- A KV namespace named `-api-keys` for storing hashed user API keys +- A Cloudflare Worker named `-gateway` running `worker.mjs` +- A Workers route binding to `${domain_name}/*` so all traffic to your domain transits the Worker +- A 100%-traffic deployment strategy (no Workers-level canary) + +The Worker exposes: + +| Path | Method | Behavior | +| --- | --- | --- | +| `/gen` | GET | Issues a mainnet API key. Salt+hashes with `KEY_SALT`, stores in KV with 365d TTL, returns raw key once | +| `/testnet/gen` | GET | Same for testnet scope | +| `/` (POST) | POST | Proxies to relayer's `/api/v1/plugins/channels/call` (mainnet path) | +| `/testnet` (POST) | POST | Proxies to relayer's `/testnet/api/v1/plugins/channels/call` | +| `/usage/me` | GET | Queries Cloudflare Analytics Engine for the caller's usage | + +The Worker injects authentication headers upstream: the upstream `Bearer` token becomes the static API key (`RELAYER_STATIC_API_KEY`), while the user's original key becomes `x-consumer-key`. This is why the ECS module sets `API_KEY_HEADER=x-consumer-key`; the relayer is told to use that header for per-user fee tracking inside the Channels plugin. + +Rate limits (defaults; tunable via tfvars): + +| Variable | Default | Meaning | +| --- | --- | --- | +| `gen_ip_rate_hour` | 2 | Max `/gen` requests per IP per hour (anti-abuse) | +| `relay_rpm_per_key` | 60 | Max relay POSTs per minute per user key | + +The Worker source (`worker.mjs`) is identical between the public module and the internal OpenZeppelin deployment: same KV-backed auth, same header rewrites, same usage tracking via Cloudflare Analytics Engine. The module defaults for the rate limits (`gen_ip_rate_hour=2`, `relay_rpm_per_key=60`) are reasonable conservative starting points; tune to your traffic profile. + +**Worker Auth-Rewrite: What Actually Happens to Headers:** + +This is the non-obvious part of the Worker. Per-caller API keys go in, a single static API key goes out, and the original caller identity is carried in a different header for downstream fee tracking. + +```mermaid +flowchart TD + ClientReq["Client request
Authorization: Bearer <user-key>
POST / (or /testnet)"] + + subgraph Worker["Cloudflare Worker · worker.mjs"] + direction TB + Hash["1. Hash user key
keyHash = sha256(KEY_SALT : user-key)"] + KV{"2. KV lookup
env.API_KEYS.get('key:' + keyHash)"} + Auth401(["401 Unauthorized"]) + Rewrite["3. Rewrite headers
Authorization: Bearer <RELAYER_STATIC_API_KEY>
x-consumer-key: <user-key>"] + Path["4. Rewrite path
POST / → /api/v1/plugins/channels/call
POST /testnet → /testnet/api/v1/plugins/channels/call"] + Track["5. Track usage
Analytics Engine: indexes=keyHash · blobs=path,host · doubles=1"] + end + + Upstream["Upstream — Relayer
• Authorization: Bearer <static-key>
  → authenticates against API_KEY env
• x-consumer-key: <user-key>
  → plugin reads via API_KEY_HEADER
  → drives FEE_LIMIT per-caller tracking"] + + ClientReq --> Hash + Hash --> KV + KV -->|not found / inactive / wrong scope| Auth401 + KV -->|valid| Rewrite + Rewrite --> Path + Path --> Track + Track --> Upstream +``` + +**Operational consequence:** the upstream relayer never sees user-supplied keys directly. A compromised user key only compromises that user's quota; it cannot escalate to relayer-level admin operations because those require the static key (which only the Worker holds). + +### Step 5.8: Bootstrap the Channel-Account Pool + +The module deploys the *infrastructure* but does not provision the *channel accounts* (Stellar entities). For this you use the `oz-channels` CLI included in the `cli/` directory of this repo. + +Install the CLI: + +```bash +# From the root of this repo +cd cli +bun install +bun run build + +# Link the CLIs globally +cd packages/oz-channels && bun link +cd ../oz-relayer && bun link + +# Verify +oz-channels --help +oz-relayer --help +``` + + +Requires the [Bun](https://bun.sh) runtime (Node.js 22+ compatible). + + +Set up a profile for your environment: + +```bash +oz-channels profile init prod-mainnet +# Prompts for: URL (your channels.your-company.com), API key, plugin ID (channels), +# admin secret, network (mainnet), test account +``` + +Then provision the pool. Start small on testnet to validate, then scale on mainnet: + +```bash +# Preview (no changes) +oz-channels bootstrap --to 200 --dry-run -p prod-mainnet + +# Provision +oz-channels bootstrap --to 200 -p prod-mainnet +``` + +The bootstrap workflow runs three phases: + +1. **Preflight audit** (parallel, configurable concurrency): checks each slot's signer existence, relayer existence, and onchain funding. +2. **Provisioning** (sequential): creates signers and relayers via the relayer's management API; tolerates 409s if records already exist. +3. **Funding** (sequential): submits funding transactions through the fund relayer using a competitive fee from Horizon `/fee_stats`; tolerates `op_already_exists`. + +After all three phases complete, the bootstrap merges the new accounts into the Channels plugin's pool via `setChannelAccounts`. + +**Workflow at a Glance:** + +```mermaid +flowchart TD + Start(["oz-channels bootstrap --to N -p <env>"]) + + P1["Phase 1: Preflight Audit (parallel)
For each slot 1..N:
• signer exists? (relayer API)
• relayer exists? (relayer API)
• on-chain funded? (Horizon)

Concurrency: --concurrency (default 10)
Gap detection across slot sequence"] + + P2["Phase 2: Provisioning (sequential)
For each slot missing signer/relayer:
• Create signer (random keypair)
• Create relayer pointing to signer

Idempotent: 409 Conflict tolerated
Delay between ops: --delay-ms (100ms)"] + + P3["Phase 3: Funding (sequential)
For each unfunded slot:
• GET Horizon /fee_stats for live fee
• Submit funding tx via --funding-relayer
  (default: channels-fund)
• --starting-balance XLM per account
  (default: 2)

Idempotent: op_already_exists tolerated"] + + Final["Final: setChannelAccounts
Merge new IDs into plugin's active pool
via management API.

Pool now ready for func+auth submissions."] + + AuditStop(["--audit stops here"]) + DryStop(["--dry-run stops here"]) + + Start --> P1 + P1 --> AuditStop + P1 --> DryStop + P1 --> P2 + P2 --> P3 + P3 --> Final + + Modes["Mode flags
--dry-run · preview only
--audit · report issues only
--allow-gaps · skip gap-detection
--verbose · per-slot output"] +``` + +**Production sizing reference:** the reference deployment runs ~1,000 relayers total. For a fresh deploy that needs to handle the full reference load, bootstrap several hundred channel accounts. For lower-load deployments, start with 50–100 and scale incrementally. + +#### Scaling Beyond ~100 Channels + +When scaling the pool aggressively (e.g. 100 → 1000 channels), `oz-channels bootstrap` will start failing with `TRY_AGAIN_LATER` or `tx_bad_seq` errors from Horizon. This happens because every `createAccount` operation uses the fund relayer (`channels-fund`) as the transaction source, serializing all submissions on a single sequence number. Under high concurrency, Horizon rejects the overlapping submissions. + +Use `scripts/fund-new-channels.ts` instead; it routes the transaction source through an existing funded channel account (e.g. `channel-0001`) while keeping the fund relayer as the operation source (so the treasury still pays). It also batches up to 100 `createAccount` ops per transaction, so a 100→1000 scale-up fits in ~9 submissions. + +```bash +npx tsx scripts/fund-new-channels.ts \ + --env mainnet \ + --api-key \ + --source-relayer channel-0001 \ + --fund-relayer channels-fund \ + --from 101 --to 1000 \ + --starting-balance 2 \ + --report fund-report.json +``` + +The script is idempotent; it preflights every slot via the relayer API and Horizon, skipping any account already funded onchain. Safe to re-run. + +#### Gap Detection + +Gap-detection guards against accidentally provisioning a sparse pool: + +```bash +# Will error if slots 11–19 don't exist +oz-channels bootstrap --from 20 --to 25 -p prod-mainnet +# Error: Gap detected in slot sequence: 11-19 + +# Override only if intentional +oz-channels bootstrap --from 20 --to 25 --allow-gaps -p prod-mainnet +``` + +### Step 5.9: Verify End-to-End + +Once channels are provisioned and registered, run a smoke test: + +```bash +oz-channels smoke setup -p prod-mainnet # Deploys a smoke contract (testnet only; mainnet uses bundled contract) +oz-channels smoke run -p prod-mainnet # Submits real transactions through the pool +``` + +Tests covered include both `xdr` and `func + auth` modes against the deployed pool. Test failures here indicate misconfiguration between the infra layer (Terraform) and the application layer (relayer config + plugin config). + +--- + +## 6. Configuration Reference + +### Module-Managed Container Environment Variables + +These are set inside the ECS task definition by the Terraform module and should not be overridden unless you have a specific reason: + +| Env var | Set to | Source | +| --- | --- | --- | +| `HOST` | `0.0.0.0` | Module | +| `STELLAR_NETWORK` | `var.stellar_network` (`mainnet` or `testnet`) | Module | +| `FUND_RELAYER_ID` | `var.fund_relayer_id` (default `channels-fund`) | Module | +| `API_KEY_HEADER` | `x-consumer-key` | Module: keyed to Cloudflare Worker rewriting | +| `REPOSITORY_STORAGE_TYPE` | `redis` | Module: required for production | +| `RESET_STORAGE_ON_START` | `false` | Module | +| `METRICS_ENABLED` | `true` | Module | +| `METRICS_PORT` | `8081` | Module | +| `LOG_FORMAT` | `json` | Module | +| `LOG_LEVEL` | `var.log_level` (default `warn`) | Module | +| `REDIS_URL` | `redis://:6379` | Module: derived from ElastiCache output | +| `REDIS_READER_URL` | `redis://:6379` | Module: read/write split for ElastiCache | +| `AWS_REGION` | Module-derived | Module | +| `AWS_ACCOUNT_ID` | Module-derived | Module | +| `DISTRIBUTED_MODE` | `var.distributed_mode` (default `true`) | Module | +| `QUEUE_BACKEND` | `sqs` (when distributed) or `memory` | Module | +| `SQS_QUEUE_URL_PREFIX` | `https://sqs..amazonaws.com//-` | Module | + +### Optional Module-Managed Env Vars + +| Env var | Activated by | Purpose | +| --- | --- | --- | +| `ALLOWED_FUND_RELAYER_IDS` | `var.allowed_fund_relayer_ids` non-empty | Per-request fund-relayer override (used by x402 patterns) | + +### Production Reference Values + +For operators targeting OpenZeppelin's reference scale (~2M tx/day, ~1000 relayers, 11–25 Fargate tasks of 8 vCPU / 16 GB), these are the env-var values OpenZeppelin actually runs in production. Use them as a calibration point; do not blindly copy without sizing your downstream dependencies (Redis, RPC, KMS rate limits) to match. + +```hcl +container_environment = [ + # Worker concurrency — the biggest tuning surface + { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, + # Non-Stellar workers parked at 1 (Stellar-only deployment) + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, + + # API + plugin concurrency caps + { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, + { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, # master knob — see below + { name = "MAX_CONNECTIONS", value = "4000" }, + + # Timeouts (production uses longer timeouts than defaults) + { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, + { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, + { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, + { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, + + # Rate limits + { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, + { name = "RATE_LIMIT_BURST", value = "500" }, + + # Fee tracking — 100 XLM (1e9 stroops) per API key per 24h + { name = "FEE_LIMIT", value = "1000000000" }, + { name = "FEE_RESET_PERIOD_SECONDS", value = "86400" }, + + # Redis connection pools — sized for 11-task deployment + { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, + { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, + + # Aggressive cleanup of completed transactions (6 minutes) + { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, + + # SQS polling tuning + { name = "SQS_TRANSACTION_REQUEST_WAIT_TIME_SECONDS", value = "2" }, + { name = "SQS_TRANSACTION_SUBMISSION_WAIT_TIME_SECONDS", value = "2" }, + { name = "SQS_TRANSACTION_REQUEST_POLLER_COUNT", value = "3" }, + { name = "SQS_TRANSACTION_SUBMISSION_POLLER_COUNT", value = "3" }, + + # Contract-level pool isolation (sample contract IDs — substitute your own) + { name = "LIMITED_CONTRACTS", value = "C,C" }, + { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, + + # Alternative fund relayer for x402 traffic class (if applicable) + { name = "ALLOWED_FUND_RELAYER_IDS", value = "x402-fund-relayer-id" }, + + { name = "NODE_OPTIONS", value = "--no-deprecation" }, +] +``` + +**`PLUGIN_MAX_CONCURRENCY` is the master knob** for the Channels plugin's worker pool. Per the public-image documentation: it auto-derives most other plugin-internal settings; don't override the derived values. + +### User-Overridable Env Vars + +Anything in `var.container_environment` is merged with the managed list; **user-provided values take precedence**. Common overrides: + +```hcl +container_environment = [ + # Channels plugin-specific + { name = "FEE_LIMIT", value = "100000000" }, # stroops per API key per period + { name = "FEE_RESET_PERIOD_SECONDS", value = "86400" }, # 24h rolling window + { name = "LOCK_TTL_SECONDS", value = "30" }, # channel-account lock TTL (range 3-30) + { name = "MAX_FEE", value = "1000000" }, # max stroops per tx + { name = "LIMITED_CONTRACTS", value = "CABC...,CDEF..." }, # contracts with restricted pool access + { name = "CONTRACT_CAPACITY_RATIO", value = "0.2" }, # restrict listed contracts to 20% of pool + + # Worker concurrency overrides (defaults are sane; tune under load) + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "150" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "100" }, + + # RPC failover tuning + { name = "PROVIDER_FAILURE_THRESHOLD", value = "3" }, # consecutive fails before pause + { name = "PROVIDER_PAUSE_DURATION_SECS", value = "60" }, # pause window + { name = "PROVIDER_FAILURE_EXPIRATION_SECS", value = "60" }, # how long failures count + { name = "RPC_TIMEOUT_MS", value = "10000" }, +] +``` + + +**What `LIMITED_CONTRACTS` does.** A small number of Soroban contracts often dominate the channel-account pool's submission queue. In the OpenZeppelin reference deployment, two limited contracts together account for ~99%+ of all transactions, and the top single contract takes 73–97% of daily volume depending on its onchain phase (for example, long-running mining/harvest cycles). Left unmanaged, contracts at this concentration would starve every other contract of channel-account capacity. `LIMITED_CONTRACTS` lists the contract IDs to cap, and `CONTRACT_CAPACITY_RATIO` (between 0 and 1) sets the maximum fraction of the pool those listed contracts may collectively occupy at any one moment. **OpenZeppelin runs `CONTRACT_CAPACITY_RATIO=0.6`** in production; high-traffic contracts can take up to 60% of the pool, leaving 40% reserved for everyone else (which keeps long-tail traffic responsive even under sustained mining-protocol load). + +**How to populate the list.** Identify offenders empirically: watch per-contract channel-account-checkout metrics over hours or days (the `cloudwatch-exporter` sidecar exposes per-contract counters), and add any contract whose share of in-flight submissions consistently exceeds the rest of the population by an order of magnitude. The contracts in OpenZeppelin's list are publicly identifiable, high-throughput Soroban protocols whose normal operation generates millions of submissions per day; your offenders will be specific to your customer mix. The placeholder values shown above (`CABC...`, `CDEF...`) are illustrative; substitute your own. + + +### Module-Managed Secrets (from SSM Parameter Store) + +The module creates SSM `SecureString` parameters and wires them into the ECS task's `secrets` block. They are referenced by ARN, not by value, so secret rotation can be done in-place via `aws ssm put-parameter` without touching Terraform. + +| Container env var | SSM parameter | Required? | +| --- | --- | --- | +| `API_KEY` | `/relayer-api-key` | Yes | +| `PLUGIN_ADMIN_SECRET` | `/channels-admin-secret` | Yes | +| `WEBHOOK_SIGNING_KEY` | `/webhook-signing-key` | Conditional (set when `var.webhook_signing_key` non-empty) | +| `STORAGE_ENCRYPTION_KEY` | `/storage-encryption-key` | Conditional (set when `var.storage_encryption_key` non-empty) | + +The Terraform `lifecycle { ignore_changes = [value] }` on these parameters means: once created, Terraform will not overwrite the SSM value if you rotate it out-of-band via AWS CLI or Console. This is intentional; it lets you rotate secrets without doing a Terraform apply. + +### Plugin Configuration (Channels Plugin Runtime) + +Beyond the env vars above, the Channels plugin reads configuration baked into the Docker image via `config/config.json`. The `examples/channels-plugin-example` directory in `openzeppelin-relayer` is the reference. Key sections: + +- `relayers[]`: one entry per relayer ID, including the fund relayer (`channels-fund`) with `concurrent_transactions: true`, and one entry per channel account. The bootstrap workflow creates these via the management API, so the JSON file typically contains only the fund relayer; channels are added dynamically. +- `signers[]`: one signer per relayer. For production, every signer should be `aws_kms` (or `google_cloud_kms` if you're on GCP) with an ED25519 key spec. +- `networks[]`: Stellar network definitions including `rpc_urls`. For production, list two independent providers. +- `notifications[]`: webhook endpoints (signed with `WEBHOOK_SIGNING_KEY`). +- `plugins[]`: the Channels plugin registration; ID is `channels` (matches the API path `/api/v1/plugins/channels/call`). + +The published image ships with an **empty `config.json` stub** (empty `relayers[]`, `signers[]`, `notifications[]`. Operators must provide their own at runtime by mounting `/app/config/config.json`. Two patterns: + +- **Mount a file:** simplest; commit your `config.json` to a private artifact store and mount via ECS task volume. +- **Render at startup from secrets:** more secure; use an entrypoint wrapper that fetches signer keys from AWS Secrets Manager or Vault and writes `/app/config/config.json` before the relayer starts. + + +**Minimal `config.json` shape:** four top-level arrays. + +- `signers[]`: start with one entry: your fund-relayer's `aws_kms` signer (key ARN, region, key-spec `ECC_NIST_EDWARDS25519`). +- `relayers[]`: one entry for the fund relayer (`id: "channels-fund"`, points at the signer above, `concurrent_transactions: true`). +- `networks[]`: one Stellar network entry with `rpc_urls[]` weights. +- `plugins[]`: one entry registering the Channels plugin (`id: "channels"`, which is what makes the API path `/api/v1/plugins/channels/call` resolve). + +Channel-account signers and relayers are added at runtime by `oz-channels bootstrap` (section 5.8); they don't need to live in `config.json` at image-build time. + + +### Channel-Account Funding Policy + +Per the `oz-channels bootstrap` defaults: 2 XLM per channel at creation. Each channel needs at minimum the Stellar account reserve (1 XLM = 0.5 XLM × 2 entries) to exist onchain; the extra 1 XLM is operational buffer. Tune via `--starting-balance`. + +The fund relayer pays all fee bumps. Its working balance needs to cover sustained traffic × per-bump fee. At ~23 tx/s sustained and 100 stroops base fee, a multi-day buffer is typically tens to hundreds of XLM depending on congestion-driven fee multipliers. + +### Fee-Bump Policy + +Set via the Channels plugin's `MAX_FEE` env var (default `1000000` stroops = 0.1 XLM). This caps the fee any single transaction can spend. Under network congestion, transactions exceeding this cap are rejected by the plugin rather than submitted with a fee too low to confirm. + +For per-fund-relayer overrides (when `ALLOWED_FUND_RELAYER_IDS` is in use), the `relayer-plugin-channels` README documents per-fund-relayer fee overrides including dynamic inclusion fees. + +--- + +## 7. Operational Playbook + +This section describes routine day-2 operations. The `oz-relayer` and `oz-channels` CLIs (in the `cli/` directory of this repo) are the operator-facing interface for most of these. + +### 7.1: Deploys + +The deploy unit is the container image. The Terraform module is "infra-mostly-stable"; you re-apply it only when adding/removing AWS resources or changing module config. + +**Routine deploy** (new container image): + +1. Push the new image to ECR with a versioned tag (for example, `mainnet-1.3.40`). +2. Update `container_image` in tfvars to point at the new tag. +3. `terraform apply`: this triggers an ECS service update with `deployment_minimum_healthy_percent = 100` and `deployment_maximum_percent = 200`, so the service stays available throughout (new tasks come up, then old ones go down). + +**ALB health-check semantics during deploy:** the ALB target group uses `path = /api/v1/health` with 5-second deregistration delay, 60-second interval, 30-second timeout. Tasks are added to the target group only after passing health checks; old tasks are drained gracefully. + +**Canary deployments (the pattern OpenZeppelin runs in production):** the `relayer-channels-infra` public module ships a single ECS service. OpenZeppelin's internal production stack runs **two ECS services behind one ALB**: a stable service (`-service-mainnet`) and a canary service (`-service-canary`), both pointing at the same ECS cluster, ElastiCache Redis, and SQS queue prefix. The ALB's HTTPS listener uses `weighted_forward` across two target groups, currently configured `{stable: 100, canary: 0}` with stickiness enabled (600s duration) so a given caller stays on one variant across requests. + +To roll out a new image with canary: + +1. Push the new image to ECR with a versioned tag (for example, `mainnet-1.4.3`). +2. Update the canary service's `container_image` to the new tag and bump its `desired_count` (from the "parked" 0 to a small number (for example, 2 tasks) for a ~10% slice. +3. Shift ALB weights from `{stable: 100, canary: 0}` to `{stable: 90, canary: 10}` (or whatever ramp you want). +4. `terraform apply`. +5. Monitor canary-specific metrics for the agreed bake time (the CloudWatch namespace per task is distinct: `RelayerChannelsMainnetCanaryTransactions` vs `RelayerChannelsMainnetTransactions` for stable). +6. Promote: shift weights to `{stable: 0, canary: 100}`, then redeploy stable with the canary's image, then shift weights back to `{stable: 100, canary: 0}` and scale canary back to 0. + +The canary service inherits all the same env vars but with concurrency slightly lower (~150 vs 200 across the worker pools) to limit blast radius if the new image misbehaves. + + +Extending the public `relayer-channels-infra` module to add a canary service is a mechanical addition: copy the existing `ecs_service` block, name it `-canary`, and update the ALB listener's `forward` action to `weighted_forward` with two target groups. The public module doesn't include this today; section 10.10 sketches the key resources and pitfalls. + + +### 7.2: Rollbacks + +To roll back to a previous container image: + +1. Update `container_image` tfvars to the older tag (for example, `mainnet-1.3.39`). +2. `terraform apply`. + +The same blue/green-ish ECS deployment semantics apply: rollback proceeds with `min_healthy = 100`, so the service stays available. + +### 7.3: Scaling Fargate + +The module sets autoscaling bounds via: + +| Variable | Production default | +| --- | --- | +| `desired_count` | 2 | +| `autoscaling_min_capacity` | 2 (defaults to `desired_count`) | +| `autoscaling_max_capacity` | 10 | + +The autoscaling policy details are inherited from the upstream `terraform-aws-modules/ecs/aws//modules/service` module (CPU-based by default). To raise the ceiling under sustained load: + +```hcl +desired_count = 4 +autoscaling_min_capacity = 4 +autoscaling_max_capacity = 20 +``` + +`terraform apply` applies the change without interruption. + +### 7.4: Channel-Pool Management + +The pool grows as your traffic grows. Adding to the pool is non-destructive and idempotent: + +```bash +# Existing pool is 1..200; add slots 201..400 +oz-channels bootstrap --from 201 --to 400 -p prod-mainnet + +# List current channels in the plugin's pool +oz-channels channels list -p prod-mainnet + +# Manually adjust (with caution; protected profiles prompt for confirmation) +oz-channels channels add channel-0050 -p prod-mainnet +oz-channels channels remove channel-0050 -p prod-mainnet +``` + +`oz-channels channels set` replaces the entire registered list (destructive); use `add`/`remove` for incremental changes. + +### 7.5: Per-API-Key Fee Budget Management + +The Channels plugin tracks per-API-key fee consumption when `FEE_LIMIT` and `FEE_RESET_PERIOD_SECONDS` are set. To inspect or adjust: + +```bash +oz-channels fee usage -p prod-mainnet # see consumption +oz-channels fee limit -p prod-mainnet # see configured limit +oz-channels fee set-limit 5000000000 -p prod-mainnet # custom limit (stroops) +oz-channels fee delete-limit -p prod-mainnet # remove custom limit +``` + +This is the surface you will expose to per-customer billing reconciliation: when a customer claims they hit limits, this CLI gives you the source of truth. + +### 7.6: Monitoring Queues + +The Terraform module creates three CloudWatch alarms per queue: + +| Alarm | Threshold | Period | +| --- | --- | --- | +| `--high-depth` | 10,000 messages (status-check queues) or 5,000 (others) | 2 consecutive 5-min periods | +| `--dlq-messages` | 100 messages in DLQ | 1 × 5-min period | +| `--old-messages` | `visibility_timeout × 3` (oldest message age) | 1 × 5-min period | + +By default, `alarm_actions = []`; you must wire these alarms to an SNS topic or PagerDuty integration as a post-deploy operator step. The alarm names follow the `-` pattern so a single SNS subscription on those alarm names captures all queue health alerts. + +### 7.7: Monitoring Redis + +ElastiCache emits standard CloudWatch metrics. Key signals: + +| Metric | Watch for | +| --- | --- | +| `EngineCPUUtilization` | Spikes above 75% sustained | +| `DatabaseMemoryUsagePercentage` | Climb past 70%; capacity headroom for spikes | +| `ReplicationLag` | > 1s sustained (multi-cluster failover scenarios) | +| `CurrConnections` | Near `maxclients` (default 65000) | + +The module emits Redis slow-log and engine-log to `/aws/elasticache/-redis`. Tail with `aws logs tail`. + +### 7.8: Inspecting Transactions and Relayer State + +The `oz-relayer` CLI is the per-transaction inspection surface: + +```bash +# Transaction details +oz-relayer tx show -r channels-fund -p prod-mainnet --json + +# List recent transactions by status +oz-relayer tx list -r channels-fund --status pending -p prod-mainnet + +# Relayer-level state (balance, sequence) +oz-relayer relayer status channels-fund -p prod-mainnet +oz-relayer relayer balance channels-fund -p prod-mainnet + +# Cancel a pending transaction +oz-relayer tx cancel -r channels-fund -p prod-mainnet +``` + +For programmatic access, every command accepts `--json` for stable, parseable output. + +### 7.9: Post-Restart Checklist + +If you ever restart with `RESET_STORAGE_ON_START=true` (which wipes Redis), you need to redo the following (the service will be up but non-functional until these are done): + +1. **Re-create the signer:** call the `/api/v1/signers` endpoint with your KMS key config +2. **Re-create the fund relayer:** via the relayer API using the new signer ID +3. **Re-run the RPC override:** the PATCH to `/api/v1/networks/stellar:mainnet` with your private providers +4. **Re-bootstrap channels:** `oz-channels bootstrap --to -p ` +5. **Fund the fund relayer:** if the onchain account was recreated, send XLM to the new address + +Normal restarts and redeployments (without `RESET_STORAGE_ON_START=true`) preserve everything in Redis; none of the above is needed. + +### 7.10: Optional Lambda Monitors + +Two opt-in Lambda functions are provided by the module: + +**Balance-check Lambda** (`var.enable_balance_check_lambda = true`): +- EventBridge-scheduled (`balance_check_schedule`, default `rate(5 minutes)`) +- Polls the relayer API for fund/channel balances; emits CloudWatch metrics +- Source: `relayer_balance.mjs` in the module + +**ECS restart-on-alarm Lambda** (`var.enable_restart_on_alarm_lambda = true`): +- Subscribes to CloudWatch alarms (you wire which alarms it listens to) +- On alarm `OK → ALARM` transition, forces an ECS service `update-service --force-new-deployment` to rebuild tasks +- Use sparingly; a flapping alarm can cause restart loops. Source: `restart_ecs_on_alarm.mjs` + +--- + +## 8. Debugging Guide + +When a transaction fails, times out, or behaves unexpectedly, locate the failure in the request lifecycle before pulling logs. + +Every transaction follows two paths. The synchronous path covers everything from the client request through auth, fee-budget check, and SQS enqueue, and returns a `tx_id` as the 202 acknowledgment. The async path covers everything after: channel acquisition, transaction build and simulation, channel signing, fund-account fee-bump, RPC submission, and status polling to confirmation. Match the symptom to the path before opening CloudWatch. + +If the request never returned a `tx_id`, the failure is in the synchronous path. Check ECS service events, ALB target health, and the relayer logs for the inbound request. If the request returned a `tx_id` but the transaction never confirmed, the failure is in the async path. Start with `oz-relayer tx show ` to get the transaction's current state, then trace from there. + +Pool exhaustion, sequence drift, and an RPC throttle can all present as "transactions are failing" from the outside; each lives in a different layer and has a different fix. + +**Failure Taxonomy** + +| Symptom | Layer | First action | +| --- | --- | --- | +| No `tx_id` returned | Synchronous: auth, fee budget, or enqueue | Check ECS service events and ALB target health; tail relayer logs for the inbound request | +| `tx_id` returned, never confirmed | Async: channel acquire, build, sign, submit, or poll | `oz-relayer tx show -r channels-fund --json -p ` | +| `POOL_CAPACITY` errors | Channel pool exhausted | §10.1; then bootstrap more channels | +| `INSUFFICIENT_FEE` / stuck at `submitted` | Fee ceiling below network floor | §10.2; raise `MAX_FEE` | +| `TRY_AGAIN_LATER` in logs | Horizon throttle or per-channel saturation | §10.3; check fund account balance and RPC provider health | +| `provider paused` in logs | RPC failover triggered | §8.4; query each RPC provider's health endpoint | +| Sequence errors / `LOCKED_CONFLICT` | Redis sequence counter drift or lock contention | §8.7; inspect the affected channel's Redis key | +| DLQ accumulation | Repeated worker failures | §10.5; inspect DLQ messages for the root error | + +The debugging workflow correlates data across three sources: the relayer API, CloudWatch logs, and the Stellar Horizon API. + +### 8.1: Pick Your Entry Point + +| You have | Start with | +| --- | --- | +| Transaction ID (UUID) | `oz-relayer tx show -r channels-fund --json -p ` | +| Error message | Search logs for the error pattern | +| Time window | Tail logs for that period | +| Stellar tx hash | Query Horizon, then work backwards to the relayer's tx record | +| "What's failing right now" | Run the error-aggregation workflow (Section 8.5) | + +### 8.2: Transaction-ID to Request-ID Correlation + +Relayer logs are JSON with a `span` field containing correlation identifiers: + +```json +{ + "timestamp": "2026-01-28T16:17:45.595556Z", + "level": "DEBUG", + "fields": { "message": "..." }, + "target": "openzeppelin_relayer::domain::transaction::stellar::submit", + "span": { + "job_type": "TransactionRequest", + "relayer_id": "channels-fund", + "request_id": "Some(\"req-1769617060067-9\")", + "tx_id": "f4a33a34-6a0f-491a-8654-821efcea35ec" + } +} +``` + +The workflow: + +1. Get the transaction record: `oz-relayer tx show -r channels-fund --json -p `: gives you `created_at` (for log time-window) and the full state machine history. +2. Filter logs by `tx_id`: + + ```bash + aws logs filter-log-events \ + --log-group-name "/aws/ecs/-/task" \ + --start-time $(date -u -v-30M +%s000) \ + --filter-pattern "" + ``` + +3. Extract `request_id` from any matching log span. +4. Filter logs by `request_id` for the complete cross-task flow (a single request can hop between tasks via SQS). + +**Correlation Flow at a Glance:** + +```mermaid +flowchart TD + A1["Entry: have a tx-id
oz-relayer tx show <tx-id>
-r channels-fund --json -p <env>

Yields: status, created_at, sent_at,
confirmed_at, hash, sequence_number, fee"] + + A2["Entry: have an error / time window
aws logs filter-log-events
--filter-pattern '<error>'
--start-time <epoch-ms>

Yields: JSON log entries"] + + Log["CloudWatch log entry (JSON)
timestamp · level · target (Rust module)
span: { job_type, relayer_id, request_id, tx_id }"] + + Req["Filter by request_id
aws logs filter-log-events
--filter-pattern '<request_id>'

→ full cross-task flow, sorted by timestamp"] + + Stage["Stage map · match span.target
::transaction_request_handler — API → SQS pickup
::stellar::prepare — build + simulate
::transaction_counter_redis — sequence mgmt
::stellar::submit — RPC submission
::status_check_handler — confirmation polling
::notification_handler — webhook delivery"] + + Horizon["Cross-check on-chain (Horizon)
GET /transactions/<hash>
GET /accounts/<source-addr> | jq .sequence
GET /fee_stats | jq '{ledger_capacity_usage, fee_charged, max_fee}'"] + + Root["Common root causes
• RPC degradation — 'provider paused' in logs
• Fee uncompetitive — fee_stats.max_fee > MAX_FEE
• Sequence drift — counter < on-chain sequence
• Pool exhaustion — POOL_CAPACITY
• Lock conflict — LOCKED_CONFLICT
• Horizon rate limit — TRY_AGAIN_LATER
• Tx expired — time_bounds.max < now"] + + A1 --> Log + A2 --> Log + Log -->|extract span.request_id| Req + Req --> Stage + Stage --> Horizon + Horizon --> Root +``` + +### 8.3: Key Log Targets + +When grepping logs, these targets point at specific stages: + +| Log target | Stage | +| --- | --- | +| `openzeppelin_relayer::domain::transaction::stellar::prepare` | Transaction preparation (build + simulate) | +| `openzeppelin_relayer::domain::transaction::stellar::submit` | Transaction submission to Soroban RPC | +| `openzeppelin_relayer::repositories::transaction_counter::transaction_counter_redis` | Sequence counter updates | +| `openzeppelin_relayer::jobs::handlers::transaction_request_handler` | Job pickup from SQS | + +### 8.4: Common Log Patterns to Search + +| Pattern | Indicates | +| --- | --- | +| `provider paused` | RPC failover triggered; one or more providers degraded | +| `error`, `failed`, `timeout` | Generic failure terms | +| `sequence`, `counter` | Sequence-number drift or contention | +| `max_fee.*insufficient` | Fee bump below network minimum during congestion | +| `POOL_CAPACITY` | Channel-account pool exhausted | +| `LOCKED_CONFLICT` | Two workers attempted to acquire the same channel | +| `TRY_AGAIN_LATER` | Horizon-side throttling | + +### 8.5: Error-Aggregation Workflow + +When the question is "what's failing right now": + +1. Query the last hour of logs filtering on error severity: + + ```bash + aws logs filter-log-events \ + --log-group-name "/aws/ecs/-/task" \ + --start-time $(date -u -v-1H +%s000) \ + --filter-pattern '{ $.level = "ERROR" }' + ``` + +2. Categorize errors by stage (fee / sequence / timeout / RPC / signer). +3. Count by transaction status: are these tx that never submitted, submitted but didn't confirm, or confirmed-but-the-caller-saw-an-error? +4. Identify temporal clustering; bursts often correlate with provider events or deploys. + +### 8.6: Stuck-Transaction Workflow + +For a transaction that never confirms: + +1. Check tx status: `oz-relayer tx show ... --json`. If `submitted` with a hash, the relayer believes it sent. +2. Check onchain state: `curl https://horizon.stellar.org/transactions/`: does Horizon see it? +3. Compare fee competitiveness: + + ```bash + curl -s https://horizon.stellar.org/fee_stats | jq '{ledger_capacity_usage, fee_charged, max_fee}' + ``` + + If your fee bump is below `max_fee.mode`, the transaction may sit in the mempool until expiry. + +4. Look for `TRY_AGAIN_LATER` or `provider paused` near the submission timestamp. +5. Verify the transaction's `time_bounds.max` hasn't passed. + +### 8.7: Redis Sequence-Counter Inspection + +The relayer maintains per-account sequence counters in Redis under the key pattern: + +``` +relayer:transaction_counter:channels-fund: +``` + +To inspect (requires a bastion or AWS Session Manager into a task with Redis access; ElastiCache is VPC-scoped): + +```bash +# From inside a task or bastion with VPC access +redis-cli -h --tls +> GET relayer:transaction_counter:channels-fund:GCABCDEF... +``` + +If the relayer's counter is behind the onchain sequence (for example, after a restart or Redis snapshot restore), the next submission will fail with a sequence-conflict error. The relayer has self-healing for this via the `RelayerHealthCheck` job type, but acute drift may need a manual restart of the affected task. + +### 8.8: When to Escalate to ECS Exec + +The Fargate tasks have `enable_execute_command = true`. You can shell into a running task: + +```bash +aws ecs execute-command \ + --cluster \ + --task \ + --container \ + --interactive \ + --command "/bin/sh" +``` + +Use sparingly; production tasks should not need this for routine operations. Common legitimate uses: capturing a network trace during a suspected RPC issue, inspecting in-process state when logs don't tell a clear story. + +--- + +## 9. Security Model + +### 9.1: Secrets Handling + +All secrets are stored as `SecureString` in AWS SSM Parameter Store. The ECS task references them by ARN in the task definition's `secrets` block; ECS injects them into container environment variables at task start. No secret ever appears in: + +- The container image +- Terraform state (only ARNs) +- ECS task definition JSON +- CloudWatch logs (unless your application logs them (which the relayer does not)) + +The Terraform `lifecycle { ignore_changes = [value] }` on SSM parameters means: once provisioned, you can rotate secrets directly via `aws ssm put-parameter --overwrite` without involving Terraform. + +**Rotation procedure for `API_KEY`:** + +```bash +aws ssm put-parameter \ + --name "/-/relayer-api-key" \ + --value "$(openssl rand -hex 32)" \ + --type SecureString \ + --overwrite +``` + +Then force a task restart so the new value is picked up: + +```bash +aws ecs update-service \ + --cluster \ + --service \ + --force-new-deployment +``` + +### 9.2: Network Isolation + +- **ALB ingress:** when Cloudflare is enabled, the ALB security group is restricted to Cloudflare's published IP ranges (the module pulls these from Cloudflare's API). Public ingress to the ALB directly is blocked. When Cloudflare is disabled, you must explicitly populate `alb_allowed_ipv4_cidrs`; by default, an empty allow-list means `0.0.0.0/0` is allowed. +- **ALB egress:** restricted to the VPC CIDR (only the ECS service can be reached). +- **ECS task egress:** allowed to `0.0.0.0/0` (the task needs to reach Soroban RPC, Horizon, AWS APIs, and any webhook destinations). +- **ECS task ingress:** restricted to the ALB security group on the container port; metrics port (8081) is `self` only (sidecar containers only). +- **Redis:** security group allows ingress on port 6379 from the VPC CIDR only. In-transit encryption is enabled (`transit_encryption_enabled = true`, mode `preferred`). +- **SQS:** queues are private by default. The ECS task IAM role has scoped access (`SendMessage`, `ReceiveMessage`, `DeleteMessage`, etc.) on resources matching `-*`. + +### 9.3: IAM Least-Privilege + +The ECS task IAM role is scoped to: + +- `ssm:GetParameter*` on `arn:aws:ssm:::parameter/-/*` +- `sqs:*` on the relayer's queue ARN pattern +- `logs:CreateLogStream`, `logs:PutLogEvents`, `logs:CreateLogGroup` on the relayer's log group +- `cloudwatch:PutMetricData` (unscoped; CloudWatch metrics namespacing happens application-side) +- `aps:RemoteWrite` on the AMP workspace (when Prometheus is enabled) +- `ssmmessages:*` for ECS Exec + +The ECS execution IAM role additionally has: + +- `ssm:GetParameters` on the same SSM prefix (used to inject secrets at task start) + +No `kms:*` permissions are granted to the task role by this module; KMS access for signers is configured at the relayer-config layer per signer (operator provisions a KMS key per fund relayer or per channel-account signer). + +**OpenZeppelin's production pattern for KMS access:** rather than granting `kms:Sign` and `kms:GetPublicKey` permissions through the ECS task IAM role, the production deployment attaches the task role's principal to the **KMS key's resource policy**. This puts authorization on the key (and thus auditable in the key's CloudTrail data plane) rather than on the role. Either pattern works; the resource-policy approach scales better when you have many keys (one per fund relayer or per channel signer at scale). + +```hcl +# Sketch — attach the ECS task role to a KMS key's resource policy +resource "aws_kms_key_policy" "signer_key" { + key_id = aws_kms_key.signer.id + policy = jsonencode({ + Statement = [ + { + Sid = "AllowRelayerTaskRoleSign" + Effect = "Allow" + Principal = { + AWS = [module.relayer_channels.ecs_task_role_arn] + } + Action = ["kms:Sign", "kms:GetPublicKey", "kms:DescribeKey"] + Resource = "*" + } + ] + }) +} +``` + +### 9.4: TLS Posture + +- **ALB:** TLS 1.3 policy (`ELBSecurityPolicy-TLS13-1-2-2021-06`), HTTPS on 443, HTTP redirects to HTTPS with 301. +- **Redis:** in-transit encryption enabled, mode `preferred` (clients may connect over TLS or not; operator can tighten to `required` if all clients support TLS). +- **Cloudflare to ALB:** the **edge certificate** (client-facing TLS) is issued automatically by Cloudflare the moment a DNS record for your domain is created in the zone; there is no manual cert-provisioning step and no Terraform plumbing required for it (Universal SSL handles this). For end-to-end TLS between Cloudflare and your ALB, set the **zone SSL mode** to "Full (strict)" so Cloudflare validates the ALB's ACM cert on the back-half. The SSL-mode setting is independent of edge-cert issuance and is configured either via the Cloudflare dashboard or under Terraform control via the Cloudflare provider (`cloudflare_zone_settings_override`). + +### 9.5: Cloudflare-Side Auth Pattern + +The Worker (`worker.mjs`) does two distinct things to inbound requests: + +1. **User-key validation:** caller-provided `Authorization: Bearer ` is hashed (`SHA-256(KEY_SALT:user-key)`) and looked up in KV. Match required, scope (mainnet/testnet) enforced. +2. **Upstream injection:** the request is rewritten before forwarding to the ALB: `Authorization` becomes `Bearer `, and a new `x-consumer-key: ` header is added. + +This means: +- The upstream relayer always sees the static API key (one secret to manage). +- The relayer's Channels plugin uses `x-consumer-key` for per-caller fee tracking (configured via `API_KEY_HEADER=x-consumer-key`). +- A user key being compromised does not compromise the upstream relayer's auth; only that user's quota. + +**Key salt rotation:** `KEY_SALT` is the salt mixed into the hash. Rotating it invalidates all existing user keys (they hash to a different value). Plan key-salt rotations as a forced re-issue event; communicate ahead, run `/gen` again, and accept downtime for callers who don't re-fetch. + +### 9.6: KMS for Stellar Signers + +Channel-account and fund-relayer signers should use AWS KMS for production. Per signer: + +- **KMS key spec:** `ECC_NIST_EDWARDS25519`: the AWS KMS asymmetric-sign Ed25519 curve. This is what Stellar requires. Supported signing algorithms on this KeySpec: `ED25519_SHA_512` and `ED25519_PH_SHA_512` (pre-hashed variant). For comparison, EVM signers use `ECC_SECG_P256K1` (secp256k1). +- **IAM grants on the KMS key:** `kms:Sign` and `kms:GetPublicKey` to the ECS task role's principal. +- **CloudTrail Data Access logging** should be enabled on the key for compliance; every signature is then auditable with caller IAM principal, timestamp, request ID, and outcome. + +For rotation, follow the side-by-side procedure: provision a new KMS key, register a new signer/relayer ID, fund the new onchain account, drain the old, retire. On Stellar the onchain account address is derived from the signer's public key, so rotation always means a new account. + +### 9.7: What Is NOT in This Module + +- KMS keys for signers (operator provisions per signer) +- VPC, subnets, NAT gateway (operator provides; the module attaches to your existing VPC) +- Bastion / Session Manager access to Redis (operator's choice of access pattern) +- WAF rules on the ALB (operator's choice; module provides ingress IP restriction only) +- Multi-region replication of the deployment (single-region only) + +--- + +## 10. Key Gotchas + +### 10.1: Channel-Account Exhaustion (`POOL_CAPACITY`) + +**Symptom:** API responses contain `POOL_CAPACITY` errors; logs show requests waiting on channel acquisition. + +**Root cause:** the channel pool is too small for current concurrent in-flight transaction count. + +**Sizing formula:** + +``` +min_pool = ceil(target_TPS × avg_settlement_seconds × safety_factor) +``` + +For Stellar with ~5s settlement, `safety_factor = 1.5–2.0`. At 23 TPS sustained, `23 × 5 × 1.5 = 173` channels minimum. + +**Recovery:** `oz-channels bootstrap --from --to ` adds capacity incrementally. + +**Prevention:** alert on pool utilization above 80%, not just on `POOL_CAPACITY` errors. + +### 10.2: Fee-Bump Tuning Under Congestion + +**Symptom:** transactions submitted successfully but never confirm; Horizon `fee_stats` shows `max_fee.mode` above your configured `MAX_FEE`. + +**Root cause:** the Stellar fee market shifted above your static fee ceiling, so submissions are stuck at `INSUFFICIENT_FEE` (or sit unconfirmed until they expire). + + +**Channels Fee Policy: Read This Before Tuning.** + +The Channels plugin uses **static fee values for both limited and non-limited contracts** (the single `MAX_FEE` setting applies to everything). On `INSUFFICIENT_FEE`, the plugin **does not dynamically bump the fee**; it simply resubmits the transaction at the same fee until it confirms or expires. + +This is a deliberate policy. Because channels absorbs the inclusion fee on behalf of callers, automatic fee-bumping on `INSUFFICIENT_FEE` would cause the service's own in-flight transactions to compete against each other on price, dragging the effective fee floor up for the whole pool and turning every congestion spike into a self-inflicted fee-escalation spiral. Static fee + resubmit keeps the price floor an operator-controlled knob rather than a market-driven one, particularly important for a service that is free at the API boundary. + + +**Recovery:** raise `MAX_FEE` in the container env vars and re-deploy. Range to consider: `1,000,000` (0.1 XLM, default) up to `10,000,000` (1 XLM) for sustained congestion. The change applies uniformly across limited and non-limited contracts; there is no per-class fee override. + +**Prevention:** + +- Alert on transaction confirmation lag exceeding your SLA; sustained lag is the leading indicator that the market has overtaken `MAX_FEE`. +- Periodically diff Horizon `/fee_stats.max_fee.mode` against your configured `MAX_FEE`. If the market mode sits above your ceiling for more than a few minutes, in-flight submissions are likely stuck. +- Treat `MAX_FEE` as a control-plane setting, not a per-transaction parameter. Re-evaluate it during congestion events and after any sustained mainnet-wide fee shifts. + +### 10.3: `TRY_AGAIN_LATER` from RPC (Provider Throttle and Per-Channel Saturation) + +**Symptom:** `TRY_AGAIN_LATER` errors from RPC, or `provider paused` log entries. + +**Root cause:** two distinct origins, only one of which is a provider-side rate limit. + +1. **Provider throttling:** your RPC provider's per-key rate limit is being hit. The Terraform module does not configure RPC URLs (that lives in the relayer config inside the Docker image), so the recovery lever is at the relayer-config layer, not the infra layer. +2. **Per-channel saturation:** Stellar RPC also returns `TRY_AGAIN_LATER` when a single channel account (one relayer) has more in-flight transactions than the network or provider will accept on its behalf. This has been observed both during channel-account creation/scaling and during steady-state high-throughput broadcasting. + + +**Plugin auto-mitigation.** On `TRY_AGAIN_LATER`, the Channels plugin pulls a different *idle* channel from the pool (stack-based selection) and resubmits the transaction on that one. In steady state this manifests as a brief retry, not a user-visible failure; but if the entire pool is saturated and no idle channel is available, the error surfaces to the caller. + + +**Recovery:** + +- *Provider throttling:* add a secondary provider via `custom_rpc_urls` per-relayer or `rpc_urls` per-network. Use weights to load-balance. +- *Pool saturation:* bootstrap more channel accounts via `oz-channels bootstrap --to `. The plugin spreads load across the wider pool automatically. + +**Prevention:** + +- Always run at least two independent RPC providers for mainnet. Negotiate rate limits at peak load against your projected throughput. +- Size the channel-account pool with headroom; if peak in-flight transaction count routinely exceeds ~70% of pool size, you are a burst away from saturation and the per-channel path will start surfacing to callers. + +### 10.4: Redis Sizing Under Burst + +**Symptom:** `EngineCPUUtilization` or `DatabaseMemoryUsagePercentage` near 100%; queue depths climbing. + +**Root cause:** the `cache.r7g.large` default may be undersized for sustained 23 TPS with the full ~1000-relayer pool. + +**Recovery:** scale up: `redis_node_type = "cache.r7g.xlarge"` or larger; `terraform apply`. ElastiCache supports online resize but expect a brief failover during the operation if `redis_num_cache_clusters > 1`. + +**Prevention:** alert when memory usage exceeds 70%; baseline CPU during expected peak before sizing. + +### 10.5: SQS DLQ Accumulation + +**Symptom:** the `--dlq-messages` CloudWatch alarm fires. + +**Root cause:** a class of messages is being received more than `max_receive_count` times (6 for most queues; 2 for `transaction-submission`; 1000 for status-check variants). The 2-receive limit on `transaction-submission` is intentional; submission failures should not retry many times. + +**Recovery:** +- Inspect DLQ messages: `aws sqs receive-message --queue-url --max-number-of-messages 10`. +- The most common DLQ pattern is `transaction-submission` failures from a sustained RPC outage. Investigate root cause; if it's transient, you can re-drive from DLQ back to the main queue: + ```bash + aws sqs start-message-move-task --source-arn --destination-arn + ``` +- For persistent failures, delete the messages and root-cause the underlying issue. + +**Prevention:** wire the DLQ alarm to a high-priority pager; DLQ growth is rarely transient. + +### 10.6: Cloudflare KV Rate Limits + +**Symptom:** users reporting `/gen` failures or `Forbidden Access` errors at low rates. + +**Root cause:** the Worker enforces `gen_ip_rate_hour` (default 2) on the `/gen` endpoint. A user behind a shared NAT may share an IP with other API-key generators. + +**Recovery:** raise `gen_ip_rate_hour` cautiously. Higher values reduce the friction for legitimate users but reduce anti-abuse effectiveness. + +**Prevention:** if you front the service with multiple IP egress points (for example, a corporate VPN), set IP-aware rate limits in Cloudflare's WAF rather than in `worker.mjs`. + +### 10.7: `API_KEY_HEADER` Mismatch + +**Symptom:** users report `Unauthorized` from the relayer even though Cloudflare accepts their key. + +**Root cause:** if you have overridden `API_KEY_HEADER` to something other than `x-consumer-key`, the Cloudflare Worker (which still sends `x-consumer-key`) and the relayer's plugin (looking for whatever `API_KEY_HEADER` says) are mismatched. + +**Recovery:** either leave `API_KEY_HEADER` at the module default (`x-consumer-key`) or update the Worker code in lockstep. + +### 10.8: Bootstrap Gap Detection + +**Symptom:** `oz-channels bootstrap --from 20 --to 25` errors with "Gap detected in slot sequence: 11-19". + +**Root cause:** the bootstrap guards against accidentally provisioning a sparse pool, which can complicate ops. + +**Recovery:** if the gap is intentional, pass `--allow-gaps`. Otherwise fill the gap first. + +### 10.9: Cloudflare Worker Deployment Is 100% Strategy + +**Symptom:** none, but be aware. + +**Root cause:** the Terraform module deploys the Worker via `cloudflare_workers_deployment` with `strategy = "percentage"` and `versions = [{ percentage = 100, version_id = ... }]`. There is no canary at the Cloudflare-Worker level. + +**Recovery:** if you want gradual Worker rollout, deploy two `cloudflare_worker_version` resources and adjust the `percentage` field over time. Doing this from Terraform alone is awkward; typically operators do this via Wrangler CLI for Workers and treat the Terraform-managed version as the production reference. + +### 10.10: Canary Deployment at the ECS Layer + +**The public `relayer-channels-infra` Terraform module is a single ECS service with autoscaling.** OpenZeppelin's internal production stack adds a second ECS service for canary traffic; that pattern is documented in section 7.1 of this guide. Recap: + +- Two ECS services in one cluster: `-service-mainnet` (stable) and `-service-canary`. Both share Redis, SQS, secrets, and IAM. +- One ALB with HTTPS listener using `weighted_forward` across two target groups. Default split `{stable: 100, canary: 0}`; canary parked at `desired_count = 0`. +- Stickiness enabled (600s) so a caller stays on one variant per session. +- Promote/rollback by adjusting target-group weights and image tags. + +**Building this on top of the public module:** copy the existing `ecs_service` resource and parameterize the variant name (for example, add a `variant` input that defaults to `mainnet` and accepts `canary`); declare a second target group bound to the canary service; and change the ALB listener's `forward` action to `weighted_forward` across the two target groups, with stickiness enabled. Park the canary at `desired_count = 0` initially; raise it when promoting. Until the public module ships canary natively, this is operator-side composition. + +**Pitfalls to plan around:** + +- **Shared Redis means state mixes.** A bad canary that corrupts state in Redis affects all callers. Keep canary images conservative; only promote *fully validated* image candidates to canary, not first-pass builds. +- **Stickiness blinds you to small-percentage canary issues.** If your canary takes 5% of traffic but 95% of users get stuck to the stable variant, you only learn from 5% of caller behavior. Decide canary bake-times accordingly; at 10% canary, plan a 6-hour minimum bake; at 25%, a 2-hour minimum. +- **Auto-restart Lambda and canary is dangerous.** If you wire the optional ECS-restart-on-alarm Lambda (section 7.10) to canary alarms, a flapping canary will force restarts that mask the underlying issue. Either disable the Lambda for canary alarms or set very conservative alarm thresholds. + +--- + +## 11. Appendix + +### Reference Repositories + +| Repo | Purpose | +| --- | --- | +| `OpenZeppelin/relayer-channels-infra` | Terraform modules: the deployment unit | +| `OpenZeppelin/openzeppelin-relayer`, `examples/channels-plugin-example` | Source of the Docker image that runs in Fargate | +| `OpenZeppelin/relayer-plugin-channels` | The Channels plugin runtime (TypeScript) | + +### Reference: SQS Queue Tuning Summary + +| Queue | Visibility timeout | Max receives | Purpose | +| --- | --- | --- | --- | +| `transaction-request` | 300s | 6 | Initial transaction requests | +| `transaction-submission` | 120s | 2 | Submission to Stellar RPC | +| `status-check` | 300s | 1000 | Generic status polling | +| `status-check-evm` | 300s | 1000 | EVM status polling (unused for Stellar-only deployments) | +| `status-check-stellar` | 300s | 1000 | Stellar status polling | +| `notification` | 180s | 6 | Webhook delivery | +| `token-swap-request` | 300s | 6 | Token swap processing (Solana-specific; unused for Stellar) | +| `relayer-health-check` | 300s | 6 | Periodic relayer-state probes | + +The `max_receive_count = 1000` on status-check queues reflects that status polling is expected to retry many times before a transaction confirms; the `max_receive_count = 2` on submission queues reflects that submission failures should not retry indefinitely. These values are baked into `sqs.tf` and not exposed as Terraform variables; change them by modifying the module. + +### Reference: Module Outputs + +| Output | Description | +| --- | --- | +| `ecs_cluster_name` | Name of the ECS cluster | +| `ecs_cluster_arn` | ARN of the ECS cluster | +| `ecs_service_name` | Name of the ECS service | +| `ecr_repository_name` | Name of the ECR repository (if created by the module) | +| `ecr_repository_url` | URL of the ECR repository | +| `alb_dns_name` | DNS name of the Application Load Balancer | +| `domain_name` | The configured domain name for the service | +| `acm_certificate_arn` | ARN of the ACM certificate in use | +| `redis_primary_endpoint` | Primary endpoint for Redis writes | +| `redis_reader_endpoint` | Reader endpoint for Redis reads | +| `sqs_queue_urls` | Map of queue name to URL for all 8 SQS queues | +| `prometheus_workspace_id` | ID of the Amazon Managed Prometheus workspace (if enabled) | +| `prometheus_endpoint` | Remote-write endpoint for the Prometheus workspace | +| `ssm_parameter_prefix` | SSM Parameter Store path prefix used by the module | +| `cloudflare_worker_name` | Name of the Cloudflare Worker (if Cloudflare is enabled) | + +### Reference: Environment-Based Defaults + +Variables whose defaults differ by the `environment` input variable value: + +| Variable | `prod` | All other environments | +| --- | --- | --- | +| `desired_count` | 2 | 1 | +| `autoscaling_max_capacity` | 10 | 4 | +| `redis_node_type` | `cache.r7g.large` | `cache.t4g.medium` | +| `redis_num_cache_clusters` | 2 | 1 | +| `alb_deletion_protection` | `true` | `false` | +| `log_retention_days` | 30 | 7 | +| `task_log_retention_days` | 365 | 7 | +| Resource name suffix | _(none)_ | `-` | + +### Reference: Conditional Resource Creation + +Resources provisioned only when specific inputs are set: + +| Condition | Resources created or skipped | +| --- | --- | +| `container_image = ""` | Creates an ECR repository; otherwise uses the supplied image URI | +| `acm_certificate_arn = ""` | Requests a new ACM certificate via DNS validation; otherwise uses the provided ARN | +| `enable_cloudflare = true` | Deploys a Cloudflare Worker and wires the ALB behind Cloudflare | +| `enable_cloudflare = false` | ALB is exposed directly without Cloudflare proxy | +| `enable_balance_check_lambda = true` | Deploys the fund-relayer balance check Lambda and its CloudWatch alarm | +| `enable_restart_on_alarm_lambda = true` | Deploys the ECS restart Lambda triggered by CloudWatch alarms | +| `enable_cloudwatch_exporter = true` | Deploys the CloudWatch metrics exporter sidecar | +| `enable_prometheus = true` | Creates an Amazon Managed Prometheus workspace and configures the exporter | +| `webhook_signing_key != ""` | Stores the key in SSM and enables webhook HMAC signature verification | +| `storage_encryption_key != ""` | Stores the key in SSM and enables Redis data encryption at rest | +| `alb_access_logs_bucket != ""` | Enables ALB access logging to the specified S3 bucket | + +--- + +## Support and Feedback + +For questions on this guide, deployment issues, or improvements to the reference repositories, contact OpenZeppelin engineering through your established channel. Public-facing community channels: + +- OpenZeppelin Forum +- Issues on the relevant reference repository (Terraform module: `relayer-channels-infra`; plugin: `relayer-plugin-channels`; relayer core: `openzeppelin-relayer`) diff --git a/content/relayer/guides/stellar-relayer-gcp-operator-guide.mdx b/content/relayer/guides/stellar-relayer-gcp-operator-guide.mdx new file mode 100644 index 00000000..12d9b30b --- /dev/null +++ b/content/relayer/guides/stellar-relayer-gcp-operator-guide.mdx @@ -0,0 +1,1004 @@ +--- +title: 'Hosted Stellar Relayer on GCP: Operator Deployment Guide' +--- + +This guide covers deploying and operating the Stellar Relayer Channels service on GCP. The infrastructure runs on Cloud Run backed by Memorystore Redis, Pub/Sub for distributed job processing, and Cloud KMS for transaction signing, with optional Cloudflare Workers for API-key management and per-user rate limiting. + +Work through the deployment steps in order; each step produces configuration or keys that later steps depend on. For the AWS deployment, see the [AWS Operator Deployment Guide](/relayer/guides/stellar-relayer-aws-operator-guide). + +--- + +## 1. Architecture + +The service connects several GCP-managed components into a single transaction processing pipeline. Understanding this layout helps with capacity planning and narrows the search space when diagnosing failures. Most operational issues map to one specific layer. + +### 1.1. Cloud Architecture + +```mermaid +flowchart TD + Callers([Public callers]) + + subgraph Edge["Edge (Cloudflare, optional)"] + Worker["Cloudflare Worker
• /gen + /testnet/gen — issues API keys
• KV-backed auth, hashes with KEY_SALT
• per-IP / per-key rate limits
• rewrites Bearer→static, sets x-consumer-key
• usage tracking via Analytics Engine"] + end + + subgraph GCPEdge["GCP Edge"] + LB["External HTTPS Load Balancer
Google-managed SSL cert · HTTPS-only
HTTP→HTTPS redirect · Global static IP"] + end + + subgraph Compute["Compute"] + CloudRun["Cloud Run Service
relayer container · autoscaling 2..N instances
health: /api/v1/health · VPC connector for Redis"] + end + + subgraph State["Data plane"] + Redis[("Memorystore Redis
STANDARD_HA failover")] + PubSub[("Pub/Sub — 8 topics + subs")] + Secrets[("Secret Manager
4 secrets")] + end + + subgraph Signing["Signing"] + KMS["Cloud KMS
ED25519 keyring"] + end + + Stellar([Stellar RPC
Soroban + Horizon]) + GAR[(Artifact Registry
remote repo → ECR Public)] + + Callers --> Worker + Worker -->|"Bearer = static-key
x-consumer-key = user-key"| LB + LB --> CloudRun + CloudRun --> Redis + CloudRun --> PubSub + CloudRun --> Secrets + CloudRun --> KMS + CloudRun --> Stellar + GAR -.->|image pull| CloudRun +``` + +| Component | GCP Service | Purpose | +| --- | --- | --- | +| Edge gateway | Cloudflare Worker + KV (optional) | API-key issuance, rate limiting, usage tracking | +| Load balancer | External HTTPS LB + Google-managed cert | TLS termination, health-checked routing | +| Compute | Cloud Run v2 | Runs the relayer container with autoscaling | +| State | Memorystore Redis 7.2 | Transaction records, sequence counters, distributed locks | +| Queue | 8 Pub/Sub topics + subscriptions | Distributed transaction processing | +| Secrets | Secret Manager | API keys, admin secrets, encryption keys | +| Signing | Cloud KMS (EC_SIGN_ED25519) | Transaction signing for fund + channel accounts | +| Image registry | Artifact Registry (remote repo) | Proxies ECR Public image for Cloud Run | +| Networking | VPC + VPC Connector + Private Service Access | Private connectivity to Memorystore | + +### 1.2. App Architecture (Channels Plugin Runtime) + +```mermaid +flowchart TD + Client([API Client]) + + subgraph Relayer["Relayer API (openzeppelin-relayer)"] + Auth["Bearer auth (API_KEY from Secret Manager)
+ rate-limit middleware
+ route to plugin"] + end + + subgraph Plugin["Channels Plugin Runtime"] + Pipeline["Submission pipeline
1. Validation: auth entries, payload, scheme
2. ChannelPool: acquire a channel relayer
3. Build + Simulate: assemble Soroban tx
4. Sign + FeeBump: channel signs, fund FeeBumps
5. Submit + Wait: POST to RPC, poll status"] + Mgmt["Management API
setChannelAccounts / listChannelAccounts
setFeeLimit / getFeeUsage / getFeeLimit"] + end + + Redis[("Memorystore
state + deferred jobs")] + PubSub[("Pub/Sub
jobs")] + Accts[("Fund acct
+ channel accts
(Cloud KMS-backed)")] + Stellar([Stellar RPC]) + + Client -->|"POST /api/v1/plugins/channels/call
body: { params: { xdr } } OR { params: { func, auth } }"| Auth + Auth --> Pipeline + Auth --> Mgmt + Pipeline <--> Redis + Pipeline <--> PubSub + Mgmt <--> Redis + Pipeline -->|sign| Accts + Accts -->|signed envelope| Stellar + Pipeline -->|submit + poll| Stellar +``` + +### 1.3. Transaction Lifecycle + +```mermaid +sequenceDiagram + autonumber + actor Caller + participant CF as CF Worker + participant LB as HTTPS LB + participant API as Relayer API + participant Plugin as Channels Plugin + participant Redis as Memorystore + participant PS as Pub/Sub + participant KMS as Cloud KMS + participant RPC as Soroban RPC + + Caller->>CF: POST / · Bearer user-key + CF->>CF: hash + KV lookup + scope check + CF->>LB: rewrite Bearer→static-key, set x-consumer-key + LB->>API: TLS terminate · forward + API->>Plugin: route /plugins/channels/call + Plugin->>Redis: check fee budget + Plugin->>Redis: persist tx record + Plugin->>PS: publish transaction-request + Plugin-->>Caller: 202 Accepted + tx_id + + rect rgba(200, 220, 255, 0.4) + Note over Plugin,RPC: Async worker pickup + Plugin->>Redis: acquire channel account + Plugin->>RPC: build + simulate tx + RPC-->>Plugin: assembled envelope + Plugin->>KMS: sign w/ channel signer + KMS-->>Plugin: signature + Plugin->>KMS: fee-bump w/ fund signer + KMS-->>Plugin: fee-bumped envelope + Plugin->>RPC: submit signed envelope + Plugin->>PS: publish status-check-stellar + + loop until confirmed or expired + Plugin->>RPC: GET tx by hash + RPC-->>Plugin: pending / confirmed + end + + Plugin->>Redis: update tx record → confirmed + end +``` + +### 1.4. How Pub/Sub Queues Work + +Eight topics with pull subscriptions handle the transaction pipeline. Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) sit in Redis sorted sets until due, then get published to the topic. The topic only ever carries ready-to-process jobs; no dead-letter topics needed. + +```mermaid +flowchart TD + subgraph Producers["Producers"] + APIReq[API request] + WorkerCb[Worker callback] + DueSweep[Redis due-sweep] + end + + subgraph Topics["8 Pub/Sub topics + subscriptions"] + Q1["transaction-request"] + Q2["transaction-submission"] + Q3["status-check"] + Q4["status-check-evm"] + Q5["status-check-stellar"] + Q6["notification"] + Q7["token-swap-request"] + Q8["relayer-health-check"] + end + + Workers["Cloud Run instances
One worker pool per queue type"] + DeferredQ[("Redis sorted sets
Deferred jobs with backoff")] + + Producers --> Topics + Topics -->|pull + ack| Workers + Workers -. retry with backoff .-> DeferredQ + DeferredQ -. publish when due .-> Topics +``` + +### 1.5. How Channels Works on Stellar + +Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time; this is the constraint that caps parallel throughput on Stellar. + +The Channels service works around it with a pool of dedicated source accounts (channel accounts). Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. A separate fund account holds the XLM balance. When submitting, the service wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys. + +The pool size you provision in [section 4.10](#410-bootstrap-channels) is your throughput ceiling. See [section 12.1](#121-channel-pool-exhaustion) for the sizing formula. + +### 1.6. Resource Sizing + +Module defaults work for getting started. Operators are advised to bump them as traffic grows. + +| Resource | Module default (prod) | Current GCP deployment | +| --- | --- | --- | +| CPU | 1 vCPU | 4 vCPU | +| Memory | 2 Gi | 8 Gi | +| Min instances | 2 | 3 | +| Max instances | 10 | 20 | +| Redis tier | STANDARD_HA | STANDARD_HA | +| Redis memory | 5 GB | 5 GB | + +The module auto-adjusts sizing by environment (`prod` vs everything else): + +| Setting | prod | other | +|---------|------|-------| +| Min instances | 2 | 1 | +| Max instances | 10 | 4 | +| CPU always allocated | yes | no | +| Redis tier | STANDARD_HA | BASIC | +| Redis memory | 5 GB | 1 GB | +| LB deletion protection | on | off | +| Log retention | 30 days | 7 days | + +--- + +## 2. Prerequisites + +Gather everything in this section before running `terraform apply`. Missing any item will block either the initial deployment or the post-deploy bootstrap steps. + +### 2.1. Accounts and Access + +- **GCP project** with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings. +- **Service account** for Terraform with these roles: `editor`, `resourcemanager.projectIamAdmin`, `compute.networkAdmin`, `cloudkms.admin`, `pubsub.admin`, `secretmanager.admin`, `run.admin`, `artifactregistry.admin` +- **Domain** with DNS access (Route53, Cloud DNS, or other) +- (Optional) **Cloudflare account** for the `/gen` API-key flow + +### 2.2. Tooling + +| Tool | Version | Why | +| --- | --- | --- | +| Terraform | 1.5.0 or later | Module language constraints | +| Google provider | 5.0 or later, below 7.0 | Pinned in `versions.tf` | +| Cloudflare provider | ~> 5.0 | Required even when `enable_cloudflare = false` | +| gcloud CLI | recent stable | Auth, Artifact Registry, debugging | +| Node.js 18+ and pnpm 10+ | recent stable | Only if you modify the Channels plugin | + +### 2.3. Stellar-Side Prerequisites + +- **Soroban RPC access:** at least two independent private providers from different operators recommended for mainnet. The public image ships with the default public RPC; you override it after deployment (see [section 4.7](#47-dns-and-ssl)). +- **XLM** to fund the relayer's Stellar account and bootstrap channel accounts. Budget at least 250 XLM for 200 channel accounts plus the fund account. + +### 2.4. Repos You'll Reference + +| Repo | What it is | +| --- | --- | +| `OpenZeppelin/relayer-channels-infra` | This repo: Terraform modules + operator CLIs | +| `OpenZeppelin/openzeppelin-relayer` | The relayer application | +| `OpenZeppelin/relayer-plugin-channels` | Channels plugin (TypeScript) | + +--- + +## 3. Environments + +Run stg and prod as separate Terraform workspaces with isolated state: + +| Env | Network | Working directory | Pub/Sub prefix | VPC connector CIDR | +| --- | --- | --- | --- | --- | +| `stg` | testnet | `examples/gcp/` | `relayer-testnet-stg-` | `10.8.0.0/28` | +| `prod` | mainnet | `examples/gcp-prod/` | `relayer-mainnet-prod-` | `10.9.0.0/28` | + +Use different CIDRs if both environments share a VPC. Resource names auto-suffix with `-` except for `prod`. + +--- + +## 4. Deployment + +Work through the steps below in order on a fresh deployment. Each step produces output or configuration that later steps depend on. + +### 4.1. Authenticate + +```bash +export GOOGLE_APPLICATION_CREDENTIALS="$HOME/path/to/service-account-key.json" +``` + +If your org blocks `gcloud auth application-default login`, create a service account key in IAM & Admin > Service Accounts > Keys. + +### 4.2. Get the Module + +Reference it directly from GitHub: + +```hcl +module "relayer_channels" { + source = "git::https://github.com/OpenZeppelin/relayer-channels-infra.git//modules/gcp?ref=main" + # ... +} +``` + +Or clone and use the examples: + +```bash +git clone https://github.com/OpenZeppelin/relayer-channels-infra.git +cd relayer-channels-infra/examples/gcp # stg +cd relayer-channels-infra/examples/gcp-prod # prod +``` + +### 4.3. Configure the Terraform Backend + +In `versions.tf`, configure remote state: + +```hcl +terraform { + backend "gcs" { + bucket = "your-org-terraform-state" + prefix = "relayer-channels/prod.tfstate" + } +} +``` + +### 4.4. Create Your Tfvars + +```bash +cp terraform.tfvars.example terraform.tfvars +``` + +Minimum config: + +```hcl +project_id = "my-gcp-project" +region = "us-east1" +environment = "prod" +network = "default" +subnetwork = "default" +domain_name = "channels.your-company.com" +stellar_network = "mainnet" +queue_backend = "pubsub" +container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-latest" + +# Secrets — never commit these +relayer_api_key = "" # set via TF_VAR_relayer_api_key +channels_admin_secret = "" # set via TF_VAR_channels_admin_secret +storage_encryption_key = "" # set via TF_VAR_storage_encryption_key +``` + +Generate secrets: + +```bash +export TF_VAR_relayer_api_key="$(uuidgen | tr '[:upper:]' '[:lower:]')" +export TF_VAR_channels_admin_secret="$(openssl rand -base64 32)" +export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)" +export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)" # must be base64, not hex +``` + +### 4.5. Set Up Artifact Registry + +Cloud Run can't pull from ECR Public directly. Set up a remote repo to proxy it: + +1. GCP Console > **Artifact Registry** > **Create Repository** +2. Format: **Docker**, Mode: **Remote**, Source: **Custom**, URL: `https://public.ecr.aws` +3. Name it `ecr-public`, pick your region + +Then reference it in `container_image` in your tfvars (as shown in [section 4.4](#44-create-your-tfvars)). + +Tag scheme: `mainnet-` (pinned, use in prod), `mainnet-latest` (moves), `testnet-`, `testnet-latest`. + + +The public image ships with `mainnet.sorobanrpc.com` as the default RPC. Override it with private providers after deployment (see [section 4.7](#47-dns-and-ssl)). + + +### 4.6. Deploy + +```bash +terraform init +terraform plan -out plan.tfplan +terraform apply plan.tfplan +``` + +Takes 10–15 min. Memorystore creation is the slowest part. + +Key outputs: + +| Output | Used for | +| --- | --- | +| `load_balancer_ip` | DNS record creation | +| `cloud_run_service_name` | Service management | +| `kms_signing_key_id` | Signer creation | +| `artifact_registry_url` | Image pull path | + +### 4.7. DNS and SSL + +The Google-managed cert needs DNS pointing at the LB IP before it provisions. + +**Without Cloudflare:** +1. Create an A record: `channels.your-company.com` → `` +2. Wait 15–60 min for cert to go ACTIVE + +**With Cloudflare:** +1. Create Cloudflare A record → LB IP (proxy OFF, grey cloud) +2. Create Route53 A record → LB IP +3. Wait for cert to go ACTIVE +4. Change Route53 to CNAME → `channels.your-company.com.cdn.cloudflare.net` +5. Turn Cloudflare proxy ON (orange cloud) + + +If the cert stays `FAILED_NOT_VISIBLE` for 30+ min, bump the cert name suffix in `load-balancer.tf` (e.g. `-cert-v2` → `-cert-v3`) and re-apply. `create_before_destroy` swaps it without downtime. + + +### 4.8. Override RPC Endpoints + +The public image uses the free public Soroban RPC, which rate-limits under load. After the service is healthy, override it with your private providers. This is a **one-time call** (the config persists in Redis). + +```bash +curl -s \ + -H "Authorization: Bearer " \ + -H "Content-Type: application/json" \ + -X PATCH https://channels.your-company.com/api/v1/networks/stellar:mainnet \ + -d '{ + "rpc_urls": [ + { "url": "https://your-primary-rpc.com/key", "weight": 100 }, + { "url": "https://your-secondary-rpc.com/key", "weight": 100 } + ] + }' +``` + +Verify: + +```bash +curl -s -H "Authorization: Bearer " \ + "https://channels.your-company.com/api/v1/networks?per_page=200" \ + | jq '.data[] | select(.id=="stellar:mainnet") | .rpc_urls' +``` + +Use at least two independent providers from different operators. The relayer load-balances by weight and rotates on failure. + + +You only need to re-run this after a `RESET_STORAGE_ON_START=true` restart, which wipes Redis. Normal restarts preserve it. + + +### 4.9. Create the Signer + +```bash +ENV=mainnet API_KEY="$TF_VAR_relayer_api_key" \ +GCP_SA_KEY_FILE="$HOME/path/to/sa-key.json" \ +./scripts/gcp-kms-signer.sh +``` + +Then create the fund relayer via the relayer API: + +```bash +curl -s -X POST https://channels.your-company.com/api/v1/relayers \ + -H "Authorization: Bearer $TF_VAR_relayer_api_key" \ + -H "Content-Type: application/json" \ + -d '{ + "id": "channels-fund", + "name": "channels-fund", + "network": "mainnet", + "signer_id": "", + "network_type": "stellar", + "paused": false, + "policies": { "min_balance": 0, "fee_payment_strategy": "relayer" } + }' +``` + +### 4.10. Bootstrap Channels + + +Size the pool before bootstrapping. Formula: `min_pool = ceil(target_TPS × avg_settlement_seconds × 1.5)`. Stellar settlement averages 5–7 seconds. At 23 TPS sustained that gives 173 channels minimum. Use `--dry-run` to preview before committing. + + +Install the CLI from `cli/` in this repo: + +```bash +cd cli && bun install && bun run build +cd packages/oz-channels && bun link +cd ../oz-relayer && bun link +``` + +Set up a profile and bootstrap: + +```bash +oz-channels profile init prod-mainnet + +oz-channels bootstrap --to 200 --dry-run -p prod-mainnet # preview +oz-channels bootstrap --to 200 -p prod-mainnet # provision +``` + +#### 4.10.1. Scaling Beyond ~100 Channels + +When scaling the pool aggressively (e.g. 100 → 1000 channels), `oz-channels bootstrap` will start failing with `TRY_AGAIN_LATER` or `tx_bad_seq` errors from Horizon. This happens because every `createAccount` operation uses the fund relayer (`channels-fund`) as the transaction source, serializing all submissions on a single sequence number. Under high concurrency, Horizon rejects the overlapping submissions. + +Use `scripts/fund-new-channels.ts` instead, it routes the transaction source through an existing funded channel account (e.g. `channel-0001`) while keeping the fund relayer as the operation source (so the treasury still pays). It also batches up to 100 `createAccount` ops per transaction, so a 100→1000 scale-up fits in ~9 submissions. + +```bash +npx tsx scripts/fund-new-channels.ts \ + --env mainnet \ + --api-key \ + --source-relayer channel-0001 \ + --fund-relayer channels-fund \ + --from 101 --to 1000 \ + --starting-balance 2 \ + --report fund-report.json +``` + +The script is idempotent, it preflights every slot via the relayer API and Horizon, skipping any account already funded onchain. Safe to re-run. + +### 4.11. Verify + +```bash +curl -sS https://channels.your-company.com/api/v1/health +oz-channels smoke run -p prod-mainnet +``` + +A healthy service returns `{"status":"ok"}`. The smoke test submits a test transaction end-to-end and polls for confirmation; success prints a confirmed transaction ID. If it times out, check channel pool size and fund account balance before debugging further. + +--- + +## 5. Configuration Reference + +Most environment variables are managed by the Terraform module and should not be overridden without a specific reason. The tables below document what the module sets automatically and which values operators should tune for production scale. + +### 5.1. Module-Managed Container Environment Variables + +The Terraform module sets these. Do not override them unless you have a specific reason. + +| Env var | Set to | Source | +| --- | --- | --- | +| `HOST` | `0.0.0.0` | Module | +| `STELLAR_NETWORK` | `var.stellar_network` | Module | +| `FUND_RELAYER_ID` | `var.fund_relayer_id` | Module | +| `API_KEY_HEADER` | `x-consumer-key` | Module | +| `REPOSITORY_STORAGE_TYPE` | `redis` | Module | +| `RESET_STORAGE_ON_START` | `false` | Module | +| `METRICS_ENABLED` | `true` | Module | +| `METRICS_PORT` | `8081` | Module | +| `LOG_FORMAT` | `json` | Module | +| `LOG_LEVEL` | `var.log_level` | Module | +| `REDIS_URL` | `redis://:` | Module | +| `REDIS_READER_URL` | `redis://:` | Module | +| `GCP_PROJECT_ID` | `var.project_id` | Module | +| `GCP_REGION` | `var.region` | Module | +| `DISTRIBUTED_MODE` | `var.distributed_mode` | Module | +| `QUEUE_BACKEND` | `var.queue_backend` | Module | +| `PUBSUB_TOPIC_PREFIX` | `relayer-{network}-{environment}` | Module | +| `PUBSUB_PROJECT_ID` | `var.project_id` | Module | + +### 5.2. Module-Managed Secrets + +| Container env var | Secret Manager ID | Required? | Notes | +| --- | --- | --- | --- | +| `API_KEY` | `{app_name}-relayer-api-key` | Yes | Authenticates all API requests | +| `PLUGIN_ADMIN_SECRET` | `{app_name}-channels-admin-secret` | Yes | Required for channel management | +| `WEBHOOK_SIGNING_KEY` | `{app_name}-webhook-signing-key` | Optional | Only set if using webhook notifications | +| `STORAGE_ENCRYPTION_KEY` | `{app_name}-storage-encryption-key` | Optional | Encrypts data at rest in Redis. Strongly recommended for prod. Must be base64-encoded 32 bytes. | + +Rotation procedure: + +```bash +echo -n "new-value" | gcloud secrets versions add \ + relayer-channels-relayer-api-key --data-file=- \ + --project=your-project + +gcloud run services update relayer-channels-service \ + --region=us-east1 --project=your-project \ + --update-labels="redeploy=$(date +%s)" +``` + +### 5.3. Production Reference Values + +If you are targeting OpenZeppelin's reference scale (~2M+ tx/day), these are the env vars to tune: + +```hcl +container_environment = [ + { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY", value = "200" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY", value = "1" }, + { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY", value = "1" }, + { name = "RELAYER_CONCURRENCY_LIMIT", value = "800" }, + { name = "PLUGIN_MAX_CONCURRENCY", value = "8000" }, + { name = "MAX_CONNECTIONS", value = "4000" }, + { name = "REQUEST_TIMEOUT_SECONDS", value = "60" }, + { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" }, + { name = "PLUGIN_GLOBAL_TIMEOUT_MS", value = "55000" }, + { name = "PLUGIN_POLLING_TIMEOUT_MS", value = "45000" }, + { name = "RATE_LIMIT_REQUESTS_PER_SECOND", value = "400" }, + { name = "REDIS_POOL_MAX_SIZE", value = "3000" }, + { name = "REDIS_READER_POOL_MAX_SIZE", value = "3000" }, + { name = "TRANSACTION_EXPIRATION_HOURS", value = "0.1" }, + { name = "LIMITED_CONTRACTS", value = "C,C" }, + { name = "CONTRACT_CAPACITY_RATIO", value = "0.6" }, +] +``` + +--- + +## 6. Cloudflare (Optional) + +When enabled, a Cloudflare Worker handles API-key issuance (`/gen`), per-key rate limiting, and proxies requests to the LB with static-key injection. + +```hcl +enable_cloudflare = true +cloudflare_api_token = "your-token" +cloudflare_zone_id = "your-zone-id" +cloudflare_account_id = "your-account-id" +relayer_static_api_key = "same-as-your-relayer_api_key" +key_salt = "" +cf_analytics_api_token = "your-token" +``` + +`relayer_static_api_key` should match your `relayer_api_key`; the Worker swaps every user's Bearer token for this key upstream. `key_salt` is used to hash user keys before storing in KV. + +### 6.1. Without Cloudflare + +The `/gen` endpoint is not available; there's no self-service API-key generation. Callers authenticate directly with the `relayer_api_key` you configured. If you need per-user keys or rate limiting without Cloudflare, build that into your own API gateway layer in front of the load balancer. + +--- + +## 7. Operations + +Routine operations follow the same `terraform apply` workflow as the initial deployment. Stellar-specific operations (managing the channel pool, inspecting transactions) use the CLIs in `cli/`. + +### 7.1. Deploys + +To deploy a new version, update `container_image` in your tfvars and run `terraform apply`. Cloud Run creates a new revision and shifts traffic over automatically with no downtime. + +### 7.2. Rollbacks + +To roll back, set `container_image` to the previous version tag in your tfvars and run `terraform apply`. + +### 7.3. Scaling + +```hcl +cpu = "4" +memory = "8Gi" +min_instance_count = 3 +max_instance_count = 20 +``` + +Run `terraform apply` to pick up the new limits. Cloud Run handles the transition without downtime. + +### 7.4. Channel Pool + +```bash +oz-channels bootstrap --from 201 --to 400 -p prod-mainnet # grow the pool +oz-channels channels list -p prod-mainnet +oz-channels channels add channel-0050 -p prod-mainnet +oz-channels channels remove channel-0050 -p prod-mainnet +``` + +### 7.5. Transactions + +```bash +oz-relayer tx show -r channels-fund -p prod-mainnet --json +oz-relayer tx list -r channels-fund --status pending -p prod-mainnet +oz-relayer relayer balance channels-fund -p prod-mainnet +``` + +--- + +## 8. Observability + +The service emits structured JSON logs to Cloud Logging, Cloud Run request metrics, and Pub/Sub queue metrics. Set up the log-based metrics and alerting policies below before putting the service under production load. + +### 8.1. Logs + +Cloud Run streams structured JSON logs to Cloud Logging. + +```bash +# Errors in the last hour +gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' \ + --project=your-project --limit=20 --freshness=1h --format='value(textPayload)' + +# Filter by tx ID +gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:""' \ + --project=your-project --limit=20 --freshness=1h + +# Live tail +gcloud logging tail 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service"' \ + --project=your-project +``` + +### 8.2. Cloud Run Metrics + +Console > Cloud Run > Service > Metrics: + +| Metric | Signal | +| --- | --- | +| `container/cpu/utilization` | >80% sustained → scale up | +| `container/memory/utilization` | >70% → risk of OOM | +| `request_count` by status | 5xx spikes | +| `request_latencies` | p95/p99 degradation | +| `container/instance_count` | autoscaling behavior | + +### 8.3. Pub/Sub Metrics + +Console > Pub/Sub > Subscription > Metrics: + +| Metric | Signal | +| --- | --- | +| `num_undelivered_messages` | growing backlog → falling behind | +| `oldest_unacked_message_age` | >60s → workers stuck | +| `pull_message_operation_count` | confirms workers are active | + +### 8.4. Memorystore Metrics + +Console > Memorystore > Instance > Monitoring: + +| Metric | Signal | +| --- | --- | +| CPU utilization | >75% sustained | +| Memory usage ratio | >70% | +| Connected clients | near limit | + +### 8.5. Log-Based Metrics + +Create in Cloud Logging > Log-based Metrics > Create Metric: + +| Metric name | Filter | Purpose | +| --- | --- | --- | +| `relayer/errors` | `severity>=ERROR` | Total error rate | +| `relayer/pool_capacity` | `textPayload:"POOL_CAPACITY"` | Pool exhaustion events | +| `relayer/provider_paused` | `textPayload:"provider paused"` | RPC failover events | + +### 8.6. Alerting + +Key alert policies to set up in Cloud Monitoring > Alerting: + +| Alert | Condition | Severity | +| --- | --- | --- | +| High error rate | >50 errors in 5 min | Critical | +| Cloud Run high CPU | >80% for 10 min | Warning | +| Cloud Run high memory | >70% for 10 min | Warning | +| Pub/Sub backlog | >5000 messages for 10 min | Warning | +| Pub/Sub old messages | >300s for 5 min | Critical | +| Pool exhaustion | `POOL_CAPACITY` log > 0 in 5 min | Critical | + +### 8.7. Prometheus + +The relayer exposes metrics at `:8081/debug/metrics/scrape`. Scrape with Google Cloud Managed Prometheus or your own Prometheus instance. + +### 8.8. Stellar-Side Monitoring + +GCP metrics reflect service health. These check the Stellar network side; monitor both. + +**Fund account balance:** + +```bash +oz-relayer relayer balance channels-fund -p prod-mainnet +``` + +Alert when balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently. + +**Ledger close time:** Stellar closes a ledger roughly every 5 seconds normally. Sustained close times above 10 seconds indicate network stress and inflate settlement latency beyond your pool sizing assumptions. + +```bash +curl -sS "https://horizon.stellar.org/ledgers?order=desc&limit=5" | jq '._embedded.records[] | {sequence, closed_at}' +``` + +**`TRY_AGAIN_LATER` in logs:** Horizon is rejecting transactions due to fee competition. Raise `MAX_FEE` (see [section 12.7](#127-fee-bump-tuning-under-congestion)). If it appears alongside `provider paused`, check RPC provider health first. + +**RPC provider health:** + +```bash +curl -sS -X POST \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq . +``` + +--- + +## 9. Debugging + +Almost every failure belongs to a specific layer. Identify the layer first, then pull the logs for that component. + +A request that never returns a `tx_id` failed in the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a `tx_id` but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll). Match the symptom to the layer, then pull the logs for that component. + +Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside; each lives in a different layer and has a different fix. + +| You have | Do this | +| --- | --- | +| Transaction ID | `oz-relayer tx show -r channels-fund --json -p ` | +| Error message | Search Cloud Logging: `textPayload:""` | +| "What's broken right now" | `gcloud logging read ... AND severity>=ERROR` | +| Stellar tx hash | Check Horizon, then find the relayer tx record | + +Common log patterns: + +| Pattern | Means | +| --- | --- | +| `provider paused` | RPC failover kicked in | +| `POOL_CAPACITY` | Channel pool exhausted; bootstrap more | +| `LOCKED_CONFLICT` | Two workers grabbed the same channel | +| `TRY_AGAIN_LATER` | Horizon throttling | + +### 9.1. Redis Inspection + +Connect from a VM in the same VPC: + +```bash +redis-cli -h -p 6379 +KEYS *tx:* +GET "oz-relayer:relayer:channels-fund:tx:" +``` + +--- + +## 10. Security + +This section documents the security posture of the deployed infrastructure. Review it before go-live and consult it when rotating credentials or adjusting network ingress rules. + +### 10.1. Secrets + +All secrets are stored in Secret Manager, passed as env vars to Cloud Run. See Known Issues for the plan to switch to `secret_key_ref` references. + +### 10.2. Network Isolation + +- **Cloud Run ingress:** `INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER` in prod; `INGRESS_TRAFFIC_ALL` for testing. +- **Cloud Run egress:** VPC Connector with `PRIVATE_RANGES_ONLY`. Private traffic goes through the VPC (to Memorystore); public traffic (Stellar RPC, KMS API) goes direct. +- **Memorystore:** Private Service Access only, no public IP. +- **Pub/Sub:** IAM-scoped per topic/subscription. + +### 10.3. IAM + +The Cloud Run SA (`{app_name}-run`) gets: + +| Role | Scope | +| --- | --- | +| `secretmanager.secretAccessor` | per-secret | +| `monitoring.metricWriter` | project | +| `logging.logWriter` | project | +| `monitoring.viewer` | project | +| `cloudkms.signerVerifier` | per-key | +| `cloudkms.publicKeyViewer` | per-key | +| `pubsub.publisher` | per-topic | +| `pubsub.subscriber` | per-subscription | +| `artifactregistry.reader` | per-repo | + +### 10.4. TLS + +- **Load balancer:** Google-managed SSL cert, HTTPS on 443, HTTP redirects to HTTPS. +- **Memorystore:** transit encryption is disabled (see Known Issues). Private Service Access provides network-level isolation. +- **Cloudflare to LB:** set the Cloudflare zone SSL mode to "Full" for end-to-end TLS. + +### 10.5. Cloud KMS + +`EC_SIGN_ED25519`, SOFTWARE protection. Rotation: provision a new key, register a new signer and relayer, fund the new onchain account, drain the old one, retire it. + +--- + +## 11. Post-Restart Checklist + +If you ever restart with `RESET_STORAGE_ON_START=true` (which wipes Redis), you need to redo the following (the service will be up but non-functional until these are done): + +1. **Re-create the signer:** `./scripts/gcp-kms-signer.sh` ([section 4.9](#49-create-the-signer)) +2. **Re-create the fund relayer:** via the relayer API using the new signer ID +3. **Re-run the RPC override:** the PATCH to `/api/v1/networks/stellar:mainnet` ([section 4.8](#48-override-rpc-endpoints)) +4. **Re-bootstrap channels:** `oz-channels bootstrap --to -p ` ([section 4.10](#410-bootstrap-channels)) +5. **Fund the fund relayer:** if the onchain account was recreated, send XLM to the new address + +Normal restarts and redeployments (without `RESET_STORAGE_ON_START=true`) preserve everything in Redis; none of the above is needed. + +--- + +## 12. Gotchas + +Common deployment and operational pitfalls, with fixes. Check here first when something does not behave as expected. + +### 12.1. Channel Pool Exhaustion + +`min_pool = ceil(TPS × settlement_seconds × 1.5)`. At 23 TPS with 5s settlement: 173 channels minimum. Fix: `oz-channels bootstrap --from --to `. + +### 12.2. SSL Cert Provisioning + +Google needs DNS pointing at the LB IP before it issues the cert. With Cloudflare, turn proxy off first, wait for ACTIVE, then proxy back on. If the cert stays `FAILED_NOT_VISIBLE` for 30+ min, bump the cert name suffix in `load-balancer.tf` and re-apply (`create_before_destroy` swaps it without downtime). + +### 12.3. VPC Connector CIDR Overlap + +Each environment in the same VPC needs a different `/28` CIDR range (e.g. `10.8.0.0/28` for stg, `10.9.0.0/28` for prod). + +### 12.4. Private Service Access Shared Connection + +A VPC can hold only one Private Service Access connection to `servicenetworking.googleapis.com`. If stg creates it first, prod's apply will fail unless `update_on_creation_fail = true` is set on the connection resource. The module handles this. + +### 12.5. Pub/Sub Topic Prefix + +`PUBSUB_TOPIC_PREFIX` must match what the image expects. Double-dash errors (`relayer-mainnet-prod--`) mean the prefix has a trailing dash the image doesn't expect. Adjust via `container_environment` if needed. + +### 12.6. Encryption Key Format + +`storage_encryption_key` must be base64-encoded 32 bytes (`openssl rand -base64 32`). Hex keys fail silently with "Invalid key length: expected 32 bytes, got 0". + +### 12.7. Fee-Bump Tuning Under Congestion + +`MAX_FEE` defaults to 1M stroops (0.1 XLM). Raise to 10M during network congestion. The plugin uses static fees with no automatic bumping on `INSUFFICIENT_FEE`. + +--- + +## 13. Variables + +Full variable reference for the Terraform module. Required variables must be set in your tfvars file; optional variables have defaults that the module adjusts automatically based on the environment value. + +### 13.1. Required + +| Name | Type | Description | +|------|------|-------------| +| `project_id` | `string` | GCP project ID | +| `region` | `string` | GCP region | +| `environment` | `string` | `prod`, `stg`, etc. (1–16 chars) | +| `network` | `string` | VPC network name or self_link | +| `subnetwork` | `string` | Subnet name or self_link | +| `domain_name` | `string` | FQDN for the service | +| `container_image` | `string` | Container image URI | +| `relayer_api_key` | `string` | Relayer API key (sensitive) | +| `channels_admin_secret` | `string` | Admin secret (sensitive) | + +### 13.2. Optional: Core + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `app_name` | `string` | `"relayer-channels"` | Resource name prefix | +| `name_suffix_environment` | `bool` | `true` | Append `-{env}` to names (auto-off for prod) | +| `labels` | `map(string)` | `{}` | Labels for all resources | + +### 13.3. Optional: Networking + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `connector_machine_type` | `string` | `"e2-micro"` | VPC connector machine type | +| `connector_min_instances` | `number` | `2` | Min connector instances | +| `connector_max_instances` | `number` | `3` | Max connector instances | +| `connector_ip_cidr_range` | `string` | `"10.8.0.0/28"` | CIDR for the VPC connector (/28, must not overlap) | + +### 13.4. Optional: Container / Cloud Run + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `container_port` | `number` | `8080` | Listen port | +| `cpu` | `string` | `"1"` | CPU (`"1"`, `"2"`, `"4"`) | +| `memory` | `string` | `"2Gi"` | Memory | +| `min_instance_count` | `number` | `null` | Auto: 2 prod, 1 other | +| `max_instance_count` | `number` | `null` | Auto: 10 prod, 4 other | +| `cpu_always_allocated` | `bool` | `null` | Auto: true prod | +| `health_check_path` | `string` | `"/api/v1/health"` | Probe path | +| `container_environment` | `list(object)` | `[]` | Additional env vars (user overrides win) | + +### 13.5. Optional: Application + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `stellar_network` | `string` | `"testnet"` | `mainnet` or `testnet` | +| `fund_relayer_id` | `string` | `"channels-fund"` | Fund relayer ID | +| `distributed_mode` | `bool` | `true` | Enable distributed queue processing | +| `queue_backend` | `string` | `"pubsub"` | `pubsub` (recommended) or `redis` | +| `log_level` | `string` | `"warn"` | App log level | + +### 13.6. Optional: Secrets + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `webhook_signing_key` | `string` | `""` | Only set if using webhooks | +| `storage_encryption_key` | `string` | `""` | Base64-encoded 32 bytes. Recommended for prod. | + +### 13.7. Optional: Redis + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `redis_tier` | `string` | `null` | `BASIC` or `STANDARD_HA` (auto per env) | +| `redis_memory_size_gb` | `number` | `null` | Auto: 5 prod, 1 other | +| `redis_version` | `string` | `"REDIS_7_2"` | Redis version | + +### 13.8. Optional: Cloudflare + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `enable_cloudflare` | `bool` | `false` | Enable Workers gateway | +| `cloudflare_zone_id` | `string` | `""` | Required when Cloudflare is enabled | +| `cloudflare_account_id` | `string` | `""` | Required when Cloudflare is enabled | +| `relayer_static_api_key` | `string` | `""` | Static key injected by the Worker (sensitive) | +| `key_salt` | `string` | `""` | Salt for hashing user keys in KV (sensitive) | +| `gen_ip_rate_hour` | `number` | `2` | Max `/gen` per IP per hour | +| `relay_rpm_per_key` | `number` | `60` | Max relay RPM per key | + +### 13.9. Optional: Load Balancer + +| Name | Type | Default | Description | +|------|------|---------|-------------| +| `lb_deletion_protection` | `bool` | `null` | Auto: true prod | +| `lb_log_sample_rate` | `number` | `0` | Request log sampling (0 disables) | + +See `variables.tf` for the full list including Cloud Functions and additional networking options. + +--- + +## 14. Outputs + +The module exposes these outputs for use in downstream Terraform modules or post-deployment scripts. + +| Name | Description | +|------|-------------| +| `cloud_run_service_name` / `cloud_run_service_uri` | Service name and URL | +| `load_balancer_ip` | Static IP for DNS | +| `redis_host` / `redis_port` / `redis_read_endpoint` | Memorystore connection | +| `pubsub_topics` / `pubsub_subscriptions` | Queue resource names | +| `kms_key_ring_name` / `kms_signing_key_name` / `kms_signing_key_id` | Cloud KMS key info | +| `artifact_registry_repository` / `artifact_registry_url` | Artifact Registry info | +| `secret_ids` | Secret Manager IDs | +| `cloudflare_worker_name` | Worker name (null if disabled) | + +--- + +## 15. Known Issues + +**Redis TLS disabled:** the relayer binary doesn't support TLS for Redis connections. Memorystore is only reachable via Private Service Access (VPC peering), so traffic stays within Google's network. + +**Secrets as plain env vars:** secrets are passed as Cloud Run env vars rather than Secret Manager `secret_key_ref` references. This is a workaround for a deployment issue. Plan to switch to proper secret references. diff --git a/src/navigation/stellar.json b/src/navigation/stellar.json index 8b64502c..0508bee4 100644 --- a/src/navigation/stellar.json +++ b/src/navigation/stellar.json @@ -520,12 +520,12 @@ { "type": "page", "name": "AWS Operator Deployment Guide", - "url": "/relayer/1.5.x/guides/stellar-relayer-aws-operator-guide" + "url": "/relayer/guides/stellar-relayer-aws-operator-guide" }, { "type": "page", "name": "GCP Operator Deployment Guide", - "url": "/relayer/1.5.x/guides/stellar-relayer-gcp-operator-guide" + "url": "/relayer/guides/stellar-relayer-gcp-operator-guide" } ] }, From 969eedc9fe442e7ac1ce402bc1ba762b9af16c96 Mon Sep 17 00:00:00 2001 From: stevep0z <255929980+stevep0z@users.noreply.github.com> Date: Fri, 26 Jun 2026 20:07:42 -0500 Subject: [PATCH 4/4] docs: replace Telegram contact links with openzeppelin.com/contact --- content/monitor/1.0.x/contribution.mdx | 4 ++-- content/monitor/1.0.x/index.mdx | 2 +- content/monitor/1.1.x/contribution.mdx | 4 ++-- content/monitor/1.1.x/index.mdx | 2 +- content/monitor/1.2.x/contribution.mdx | 4 ++-- content/monitor/1.2.x/index.mdx | 2 +- content/monitor/1.3.x/contribution.mdx | 4 ++-- content/monitor/1.3.x/index.mdx | 2 +- content/monitor/contribution.mdx | 4 ++-- content/monitor/index.mdx | 2 +- content/relayer/1.0.x/index.mdx | 2 +- content/relayer/1.0.x/solana.mdx | 2 +- content/relayer/1.1.x/evm.mdx | 2 +- content/relayer/1.1.x/index.mdx | 2 +- content/relayer/1.1.x/solana.mdx | 2 +- content/relayer/1.1.x/stellar.mdx | 2 +- content/relayer/1.2.x/evm.mdx | 2 +- content/relayer/1.2.x/index.mdx | 2 +- content/relayer/1.2.x/solana.mdx | 2 +- content/relayer/1.2.x/stellar.mdx | 2 +- content/relayer/1.3.x/evm.mdx | 2 +- content/relayer/1.3.x/index.mdx | 2 +- content/relayer/1.3.x/solana.mdx | 2 +- content/relayer/1.3.x/stellar.mdx | 2 +- content/relayer/1.4.x/evm.mdx | 2 +- content/relayer/1.4.x/index.mdx | 2 +- content/relayer/1.4.x/solana.mdx | 2 +- content/relayer/1.4.x/stellar.mdx | 2 +- content/relayer/1.4.x/zama-fhevm.mdx | 2 +- content/relayer/1.5.x/evm.mdx | 2 +- content/relayer/1.5.x/index.mdx | 2 +- content/relayer/1.5.x/solana.mdx | 2 +- content/relayer/1.5.x/stellar.mdx | 2 +- content/relayer/1.5.x/zama-fhevm.mdx | 2 +- content/relayer/evm.mdx | 2 +- content/relayer/index.mdx | 2 +- content/relayer/solana.mdx | 2 +- content/relayer/stellar.mdx | 2 +- content/relayer/zama-fhevm.mdx | 2 +- 39 files changed, 44 insertions(+), 44 deletions(-) diff --git a/content/monitor/1.0.x/contribution.mdx b/content/monitor/1.0.x/contribution.mdx index 5cbd528b..6305dd9a 100644 --- a/content/monitor/1.0.x/contribution.mdx +++ b/content/monitor/1.0.x/contribution.mdx @@ -267,7 +267,7 @@ Reviewers should focus on: If your PR isn’t getting attention: -* Contact the team on [Telegram](https://t.me/openzeppelin_tg/4) +* Open a discussion on [GitHub Discussions](https://github.com/OpenZeppelin/openzeppelin-monitor/discussions) * Ensure your PR has appropriate labels * Keep PRs focused and reasonably sized @@ -293,7 +293,7 @@ Contributors must follow the [Code of Conduct](https://github.com/OpenZeppelin/o * ***GitHub Discussions***: For questions and community interaction * ***Issues***: For bug reports and feature requests -* ***Telegram***: [Join our community chat](https://t.me/openzeppelin_tg/4) +* ***Contact***: [Get in touch via our website](https://www.openzeppelin.com/contact) * ***Good First Issues***: [Find beginner-friendly issues](https://github.com/openzeppelin/openzeppelin-monitor/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue) ### Additional Resources diff --git a/content/monitor/1.0.x/index.mdx b/content/monitor/1.0.x/index.mdx index 690766ad..21ccb81c 100644 --- a/content/monitor/1.0.x/index.mdx +++ b/content/monitor/1.0.x/index.mdx @@ -1427,7 +1427,7 @@ The monitor implements a comprehensive error handling system with rich context a ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/4). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). Have feature requests or want to contribute? Join our community on [GitHub](https://github.com/OpenZeppelin/openzeppelin-monitor/) diff --git a/content/monitor/1.1.x/contribution.mdx b/content/monitor/1.1.x/contribution.mdx index 429f5a9d..471ec9bb 100644 --- a/content/monitor/1.1.x/contribution.mdx +++ b/content/monitor/1.1.x/contribution.mdx @@ -299,7 +299,7 @@ Reviewers should focus on: If your PR isn’t getting attention: -* Contact the team on [Telegram](https://t.me/openzeppelin_tg/4) +* Open a discussion on [GitHub Discussions](https://github.com/OpenZeppelin/openzeppelin-monitor/discussions) * Ensure your PR has appropriate labels * Keep PRs focused and reasonably sized @@ -325,7 +325,7 @@ Contributors must follow the [Code of Conduct](https://github.com/OpenZeppelin/o * ***GitHub Discussions***: For questions and community interaction * ***Issues***: For bug reports and feature requests -* ***Telegram***: [Join our community chat](https://t.me/openzeppelin_tg/4) +* ***Contact***: [Get in touch via our website](https://www.openzeppelin.com/contact) * ***Good First Issues***: [Find beginner-friendly issues](https://github.com/openzeppelin/openzeppelin-monitor/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue) ### Additional Resources diff --git a/content/monitor/1.1.x/index.mdx b/content/monitor/1.1.x/index.mdx index 3709460d..8d29347c 100644 --- a/content/monitor/1.1.x/index.mdx +++ b/content/monitor/1.1.x/index.mdx @@ -1522,7 +1522,7 @@ The monitor implements a comprehensive error handling system with rich context a ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/4). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). Have feature requests or want to contribute? Join our community on [GitHub](https://github.com/OpenZeppelin/openzeppelin-monitor/) diff --git a/content/monitor/1.2.x/contribution.mdx b/content/monitor/1.2.x/contribution.mdx index abaa3936..f3592d2c 100644 --- a/content/monitor/1.2.x/contribution.mdx +++ b/content/monitor/1.2.x/contribution.mdx @@ -310,7 +310,7 @@ Reviewers should focus on: If your PR isn’t getting attention: -* Contact the team on [Telegram](https://t.me/openzeppelin_tg/4) +* Open a discussion on [GitHub Discussions](https://github.com/OpenZeppelin/openzeppelin-monitor/discussions) * Ensure your PR has appropriate labels * Keep PRs focused and reasonably sized @@ -336,7 +336,7 @@ Contributors must follow the [Code of Conduct](https://github.com/OpenZeppelin/o * ***GitHub Discussions***: For questions and community interaction * ***Issues***: For bug reports and feature requests -* ***Telegram***: [Join our community chat](https://t.me/openzeppelin_tg/4) +* ***Contact***: [Get in touch via our website](https://www.openzeppelin.com/contact) * ***Good First Issues***: [Find beginner-friendly issues](https://github.com/openzeppelin/openzeppelin-monitor/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue) ### Additional Resources diff --git a/content/monitor/1.2.x/index.mdx b/content/monitor/1.2.x/index.mdx index 7bab4a1b..370e486e 100644 --- a/content/monitor/1.2.x/index.mdx +++ b/content/monitor/1.2.x/index.mdx @@ -1771,7 +1771,7 @@ The monitor implements a comprehensive error handling system with rich context a ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/4). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). Have feature requests or want to contribute? Join our community on [GitHub](https://github.com/OpenZeppelin/openzeppelin-monitor/) diff --git a/content/monitor/1.3.x/contribution.mdx b/content/monitor/1.3.x/contribution.mdx index abaa3936..f3592d2c 100644 --- a/content/monitor/1.3.x/contribution.mdx +++ b/content/monitor/1.3.x/contribution.mdx @@ -310,7 +310,7 @@ Reviewers should focus on: If your PR isn’t getting attention: -* Contact the team on [Telegram](https://t.me/openzeppelin_tg/4) +* Open a discussion on [GitHub Discussions](https://github.com/OpenZeppelin/openzeppelin-monitor/discussions) * Ensure your PR has appropriate labels * Keep PRs focused and reasonably sized @@ -336,7 +336,7 @@ Contributors must follow the [Code of Conduct](https://github.com/OpenZeppelin/o * ***GitHub Discussions***: For questions and community interaction * ***Issues***: For bug reports and feature requests -* ***Telegram***: [Join our community chat](https://t.me/openzeppelin_tg/4) +* ***Contact***: [Get in touch via our website](https://www.openzeppelin.com/contact) * ***Good First Issues***: [Find beginner-friendly issues](https://github.com/openzeppelin/openzeppelin-monitor/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue) ### Additional Resources diff --git a/content/monitor/1.3.x/index.mdx b/content/monitor/1.3.x/index.mdx index 7bab4a1b..370e486e 100644 --- a/content/monitor/1.3.x/index.mdx +++ b/content/monitor/1.3.x/index.mdx @@ -1771,7 +1771,7 @@ The monitor implements a comprehensive error handling system with rich context a ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/4). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). Have feature requests or want to contribute? Join our community on [GitHub](https://github.com/OpenZeppelin/openzeppelin-monitor/) diff --git a/content/monitor/contribution.mdx b/content/monitor/contribution.mdx index abaa3936..f3592d2c 100644 --- a/content/monitor/contribution.mdx +++ b/content/monitor/contribution.mdx @@ -310,7 +310,7 @@ Reviewers should focus on: If your PR isn’t getting attention: -* Contact the team on [Telegram](https://t.me/openzeppelin_tg/4) +* Open a discussion on [GitHub Discussions](https://github.com/OpenZeppelin/openzeppelin-monitor/discussions) * Ensure your PR has appropriate labels * Keep PRs focused and reasonably sized @@ -336,7 +336,7 @@ Contributors must follow the [Code of Conduct](https://github.com/OpenZeppelin/o * ***GitHub Discussions***: For questions and community interaction * ***Issues***: For bug reports and feature requests -* ***Telegram***: [Join our community chat](https://t.me/openzeppelin_tg/4) +* ***Contact***: [Get in touch via our website](https://www.openzeppelin.com/contact) * ***Good First Issues***: [Find beginner-friendly issues](https://github.com/openzeppelin/openzeppelin-monitor/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue) ### Additional Resources diff --git a/content/monitor/index.mdx b/content/monitor/index.mdx index 7bab4a1b..370e486e 100644 --- a/content/monitor/index.mdx +++ b/content/monitor/index.mdx @@ -1771,7 +1771,7 @@ The monitor implements a comprehensive error handling system with rich context a ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/4). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). Have feature requests or want to contribute? Join our community on [GitHub](https://github.com/OpenZeppelin/openzeppelin-monitor/) diff --git a/content/relayer/1.0.x/index.mdx b/content/relayer/1.0.x/index.mdx index d3cb442a..43510199 100644 --- a/content/relayer/1.0.x/index.mdx +++ b/content/relayer/1.0.x/index.mdx @@ -327,7 +327,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.0.x/solana.mdx b/content/relayer/1.0.x/solana.mdx index dc84d013..cbc05c41 100644 --- a/content/relayer/1.0.x/solana.mdx +++ b/content/relayer/1.0.x/solana.mdx @@ -207,7 +207,7 @@ See [API Reference](/relayer/1.0.x/api_reference) and [SDK examples, window=_bla ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.1.x/evm.mdx b/content/relayer/1.1.x/evm.mdx index 36b326a7..fa221ea8 100644 --- a/content/relayer/1.1.x/evm.mdx +++ b/content/relayer/1.1.x/evm.mdx @@ -297,7 +297,7 @@ Enable metrics and monitor: For help with EVM integration: -* Join our [Telegram](https://t.me/openzeppelin_tg/2) community +* [Contact us](https://www.openzeppelin.com/contact) for general support * Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) * Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/1.1.x/index.mdx b/content/relayer/1.1.x/index.mdx index 50ebabe4..5839a455 100644 --- a/content/relayer/1.1.x/index.mdx +++ b/content/relayer/1.1.x/index.mdx @@ -326,7 +326,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.1.x/solana.mdx b/content/relayer/1.1.x/solana.mdx index aa73fc21..33c097bb 100644 --- a/content/relayer/1.1.x/solana.mdx +++ b/content/relayer/1.1.x/solana.mdx @@ -213,7 +213,7 @@ See [API Reference](/relayer/1.1.x/api) and [SDK examples](https://github.com/Op ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.1.x/stellar.mdx b/content/relayer/1.1.x/stellar.mdx index 1deab27e..c0ef7f2f 100644 --- a/content/relayer/1.1.x/stellar.mdx +++ b/content/relayer/1.1.x/stellar.mdx @@ -324,7 +324,7 @@ Soroban operations support different authorization modes: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.2.x/evm.mdx b/content/relayer/1.2.x/evm.mdx index 03331441..7fef34ff 100644 --- a/content/relayer/1.2.x/evm.mdx +++ b/content/relayer/1.2.x/evm.mdx @@ -302,7 +302,7 @@ Enable metrics and monitor: For help with EVM integration: -* Join our [Telegram](https://t.me/openzeppelin_tg/2) community +* [Contact us](https://www.openzeppelin.com/contact) for general support * Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) * Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/1.2.x/index.mdx b/content/relayer/1.2.x/index.mdx index 83474db9..69859152 100644 --- a/content/relayer/1.2.x/index.mdx +++ b/content/relayer/1.2.x/index.mdx @@ -340,7 +340,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.2.x/solana.mdx b/content/relayer/1.2.x/solana.mdx index 249a6b68..bd317139 100644 --- a/content/relayer/1.2.x/solana.mdx +++ b/content/relayer/1.2.x/solana.mdx @@ -311,7 +311,7 @@ For complete REST API examples with both options, see the [SDK Solana examples]( ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.2.x/stellar.mdx b/content/relayer/1.2.x/stellar.mdx index 6eb1a101..a351fd8b 100644 --- a/content/relayer/1.2.x/stellar.mdx +++ b/content/relayer/1.2.x/stellar.mdx @@ -413,7 +413,7 @@ For complete examples: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.3.x/evm.mdx b/content/relayer/1.3.x/evm.mdx index babe6ff1..09938f24 100644 --- a/content/relayer/1.3.x/evm.mdx +++ b/content/relayer/1.3.x/evm.mdx @@ -302,7 +302,7 @@ Enable metrics and monitor: For help with EVM integration: -* Join our [Telegram](https://t.me/openzeppelin_tg/2) community +* [Contact us](https://www.openzeppelin.com/contact) for general support * Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) * Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/1.3.x/index.mdx b/content/relayer/1.3.x/index.mdx index 48faeda5..4a596e40 100644 --- a/content/relayer/1.3.x/index.mdx +++ b/content/relayer/1.3.x/index.mdx @@ -341,7 +341,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.3.x/solana.mdx b/content/relayer/1.3.x/solana.mdx index f5fda8bd..f44a93e3 100644 --- a/content/relayer/1.3.x/solana.mdx +++ b/content/relayer/1.3.x/solana.mdx @@ -342,7 +342,7 @@ For complete REST API examples with both options, see the [SDK Solana examples]( ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.3.x/stellar.mdx b/content/relayer/1.3.x/stellar.mdx index cb87d9e7..710852d5 100644 --- a/content/relayer/1.3.x/stellar.mdx +++ b/content/relayer/1.3.x/stellar.mdx @@ -555,7 +555,7 @@ For complete examples: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.4.x/evm.mdx b/content/relayer/1.4.x/evm.mdx index df49db77..f6416fec 100644 --- a/content/relayer/1.4.x/evm.mdx +++ b/content/relayer/1.4.x/evm.mdx @@ -313,7 +313,7 @@ Enable metrics and monitor: For help with EVM integration: -- Join our [Telegram](https://t.me/openzeppelin_tg/2) community +- [Contact us](https://www.openzeppelin.com/contact) for general support - Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) - Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/1.4.x/index.mdx b/content/relayer/1.4.x/index.mdx index 499f3448..f2dc09e5 100644 --- a/content/relayer/1.4.x/index.mdx +++ b/content/relayer/1.4.x/index.mdx @@ -341,7 +341,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.4.x/solana.mdx b/content/relayer/1.4.x/solana.mdx index e9f657cc..d3293e0e 100644 --- a/content/relayer/1.4.x/solana.mdx +++ b/content/relayer/1.4.x/solana.mdx @@ -342,7 +342,7 @@ For complete REST API examples with both options, see the [SDK Solana examples]( ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.4.x/stellar.mdx b/content/relayer/1.4.x/stellar.mdx index 10293d3d..f0631237 100644 --- a/content/relayer/1.4.x/stellar.mdx +++ b/content/relayer/1.4.x/stellar.mdx @@ -555,7 +555,7 @@ For complete examples: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.4.x/zama-fhevm.mdx b/content/relayer/1.4.x/zama-fhevm.mdx index f68fbbbf..ee512f5a 100644 --- a/content/relayer/1.4.x/zama-fhevm.mdx +++ b/content/relayer/1.4.x/zama-fhevm.mdx @@ -152,4 +152,4 @@ The example contract is deployed from [Zama's fhevm-hardhat-template](https://gi ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). diff --git a/content/relayer/1.5.x/evm.mdx b/content/relayer/1.5.x/evm.mdx index df49db77..f6416fec 100644 --- a/content/relayer/1.5.x/evm.mdx +++ b/content/relayer/1.5.x/evm.mdx @@ -313,7 +313,7 @@ Enable metrics and monitor: For help with EVM integration: -- Join our [Telegram](https://t.me/openzeppelin_tg/2) community +- [Contact us](https://www.openzeppelin.com/contact) for general support - Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) - Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/1.5.x/index.mdx b/content/relayer/1.5.x/index.mdx index 499f3448..f2dc09e5 100644 --- a/content/relayer/1.5.x/index.mdx +++ b/content/relayer/1.5.x/index.mdx @@ -341,7 +341,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/1.5.x/solana.mdx b/content/relayer/1.5.x/solana.mdx index e9f657cc..d3293e0e 100644 --- a/content/relayer/1.5.x/solana.mdx +++ b/content/relayer/1.5.x/solana.mdx @@ -342,7 +342,7 @@ For complete REST API examples with both options, see the [SDK Solana examples]( ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.5.x/stellar.mdx b/content/relayer/1.5.x/stellar.mdx index 10293d3d..f0631237 100644 --- a/content/relayer/1.5.x/stellar.mdx +++ b/content/relayer/1.5.x/stellar.mdx @@ -555,7 +555,7 @@ For complete examples: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/1.5.x/zama-fhevm.mdx b/content/relayer/1.5.x/zama-fhevm.mdx index f68fbbbf..ee512f5a 100644 --- a/content/relayer/1.5.x/zama-fhevm.mdx +++ b/content/relayer/1.5.x/zama-fhevm.mdx @@ -152,4 +152,4 @@ The example contract is deployed from [Zama's fhevm-hardhat-template](https://gi ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). diff --git a/content/relayer/evm.mdx b/content/relayer/evm.mdx index df49db77..f6416fec 100644 --- a/content/relayer/evm.mdx +++ b/content/relayer/evm.mdx @@ -313,7 +313,7 @@ Enable metrics and monitor: For help with EVM integration: -- Join our [Telegram](https://t.me/openzeppelin_tg/2) community +- [Contact us](https://www.openzeppelin.com/contact) for general support - Open an issue on our [GitHub repository](https://github.com/OpenZeppelin/openzeppelin-relayer) - Check our [comprehensive documentation](https://docs.openzeppelin.com/relayer) diff --git a/content/relayer/index.mdx b/content/relayer/index.mdx index 499f3448..f2dc09e5 100644 --- a/content/relayer/index.mdx +++ b/content/relayer/index.mdx @@ -341,7 +341,7 @@ The OpenZeppelin Relayer is designed to function as a backend service and is not ## Support -For support or inquiries, contact us on [Telegram](https://t.me/openzeppelin_tg/2). +For support or inquiries, [contact us](https://www.openzeppelin.com/contact). ## License This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. diff --git a/content/relayer/solana.mdx b/content/relayer/solana.mdx index e9f657cc..d3293e0e 100644 --- a/content/relayer/solana.mdx +++ b/content/relayer/solana.mdx @@ -342,7 +342,7 @@ For complete REST API examples with both options, see the [SDK Solana examples]( ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/stellar.mdx b/content/relayer/stellar.mdx index 10293d3d..f0631237 100644 --- a/content/relayer/stellar.mdx +++ b/content/relayer/stellar.mdx @@ -555,7 +555,7 @@ For complete examples: ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact). ## License diff --git a/content/relayer/zama-fhevm.mdx b/content/relayer/zama-fhevm.mdx index f68fbbbf..ee512f5a 100644 --- a/content/relayer/zama-fhevm.mdx +++ b/content/relayer/zama-fhevm.mdx @@ -152,4 +152,4 @@ The example contract is deployed from [Zama's fhevm-hardhat-template](https://gi ## Support -For help, join our [Telegram](https://t.me/openzeppelin_tg/2) or open an issue on GitHub. +For help, open an issue on [GitHub](https://github.com/OpenZeppelin/openzeppelin-relayer/issues) or [contact us](https://www.openzeppelin.com/contact).