Skip to main content
Version: Next

Dual-region operational procedure

info

This procedure has been updated in the Camunda 8.6 release. The procedure used in Camunda 8.5 has been deprecated, and compatibility will be removed in the 8.7 release.

Introduction

This operational blueprint procedure is a step-by-step guide on how to restore operations in the case of a total region failure. It explains how to temporarily restore functionality in the surviving region, and how to ultimately do a full recovery to restore the dual-region setup. The operational procedure builds on top of the dual-region AWS setup guide, but is generally applicable for any dual-region setup.

Before proceeding with the operational procedure, thoroughly review and understand the contents of the dual-region concept page. This page outlines various limitations and requirements pertinent to the procedure, which are crucial for successful execution.

Disclaimer

danger

Running dual-region setups requires the users to be able to detect any regional failures and to implement the necessary operational procedure for failover and failback, matching their environments. The example blueprint procedure is described below.

Prerequisites

  • A dual-region Camunda 8 setup installed in two different regions, preferably derived from our AWS dual-region guide.
  • zbctl to interact with the Zeebe cluster.

Terminology

  • Surviving region
    • A surviving region refers to a region within a dual-region setup that remains operational and unaffected by a failure or disaster that affects other regions.
  • Lost region
    • A lost region refers to a region within a dual-region setup that becomes unavailable or unusable due to a failure or disaster.
  • Recreated region
    • A recreated region refers to a region within a dual-region setup that was previously lost but has been restored or recreated to resume its operational state.
    • We assume this region contains no Camunda 8 deployments or related persistent volumes. Ensure this is the case before executing the failover procedure.

Procedure

We handle the loss of both active and passive regions using the same procedure. For clarity, this section focuses on the scenario where the passive region is lost while the active region remains operational.

Key Steps to Handle Passive Region Loss

  1. Traffic Rerouting: Reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.)
  2. Temporary Loss Scenario: If the region loss is temporary (for example, due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost.
  3. Procedure Phases
    • Failover Phase: Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance.
    • Failback Phase: Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8.
danger

For the failback procedure, your recreated region cannot contain any active Camunda 8 deployments or leftover persistent volumes related to Camunda 8 or its Elasticsearch instance. You must start from a clean slate and not bring old data from the lost region, as states may have diverged.

The following procedures are building on top of the work done in the AWS setup guide about deploying Camunda 8 to two Kubernetes clusters in different regions. We assume you have your own copy of the c8-multi-region repository and previously completed changes in the camunda-values.yml to adjust them to your setup.

Ensure you have followed deploy Camunda 8 to the clusters to have Camunda 8 installed and configured for a dual-region setup.

Environment prerequisites

Ensure you have followed environment prerequisites to have the general environment variables set up already.

We will try to refrain from always mentioning both possible scenarios (losing either region 0 or region 1). Instead, we generalized the commands and require you to do a one-time setup to configure environment variables to help execute the procedure based on the surviving and to be recreated region.

Depending on which region you lost, select the correct tab below and export those environment variables to your terminal for a smoother procedure execution:

export CLUSTER_SURVIVING=$CLUSTER_1
export CLUSTER_RECREATED=$CLUSTER_0
export CAMUNDA_NAMESPACE_SURVIVING=$CAMUNDA_NAMESPACE_1
export CAMUNDA_NAMESPACE_RECREATED=$CAMUNDA_NAMESPACE_0
export REGION_SURVIVING=region1
export REGION_RECREATED=region0

Failover

Remove lost brokers from Zeebe cluster in the surviving region

Current state
Desired state

Description / Code


Current StateDesired State
You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region
The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention.

How to get there

You will port-forward the Zeebe Gateway in the surviving region to the local host to interact with the Gateway.

The following alternatives to port-forwarding are possible:

  • if it's exposed to the outside, one can skip port-forwarding and use the URL directly
  • one can exec into an existing pod (such as Elasticsearch), and curl from there
  • or temporarily run an Ubuntu pod in the cluster to curl from there

In our example, we went with port-forwarding to a local host, but other alternatives can also be used.

  1. Use the zbctl client to retrieve list of remaining brokers
kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING
zbctl status --insecure --address localhost:26500
Example output
Cluster size: 8
Partitions count: 8
Replication factor: 4
Gateway version: 8.6.0
Brokers:
Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Leader, Healthy
Partition 6 : Follower, Healthy
Partition 7 : Follower, Healthy
Partition 8 : Follower, Healthy
Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Follower, Healthy
Partition 2 : Follower, Healthy
Partition 3 : Follower, Healthy
Partition 8 : Leader, Healthy
Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 2 : Follower, Healthy
Partition 3 : Leader, Healthy
Partition 4 : Follower, Healthy
Partition 5 : Follower, Healthy
Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 4 : Follower, Healthy
Partition 5 : Follower, Healthy
Partition 6 : Follower, Healthy
Partition 7 : Leader, Healthy
  1. Portforward the service of the Zeebe Gateway for the management REST API
kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING
  1. Based on the Cluster Scaling APIs, send a request to the Zeebe Gateway to redistribute the load to the remaining brokers, thereby removing the lost brokers. In our example, we have lost region 1 and with that our uneven brokers. This means we will have to redistribute to our existing even brokers.
curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?force=true' -H 'Content-Type: application/json' -d '["0", "2", "4", "6"]'

Verification

Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that the cluster size has decreased to 4, partitions have been redistributed over the remaining brokers, and new leaders have been elected.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING
zbctl status --insecure --address localhost:26500
Example output
Cluster size: 4
Partitions count: 8
Replication factor: 2
Gateway version: 8.6.0
Brokers:
Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Leader, Healthy
Partition 6 : Leader, Healthy
Partition 7 : Follower, Healthy
Partition 8 : Follower, Healthy
Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Follower, Healthy
Partition 2 : Leader, Healthy
Partition 3 : Follower, Healthy
Partition 8 : Leader, Healthy
Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 2 : Follower, Healthy
Partition 3 : Leader, Healthy
Partition 4 : Follower, Healthy
Partition 5 : Follower, Healthy
Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 4 : Leader, Healthy
Partition 5 : Leader, Healthy
Partition 6 : Follower, Healthy
Partition 7 : Leader, Healthy

You can also use the Zeebe Gateway's REST API to ensure the scaling progress has been completed. For better readability of the output, it is recommended to use jq.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING
curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange
Example output
{
"id": 2,
"status": "COMPLETED",
"startedAt": "2024-08-23T11:33:08.355681311Z",
"completedAt": "2024-08-23T11:33:09.170531963Z"
}

Failback

Deploy Camunda 8 in the newly created region

Current state
Desired state

Description / Code


DetailsCurrent StateDesired State
Camunda 8A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch.Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region.
Operate and TasklistOperate and Tasklist are operational in the standalone region.Operate and Tasklist need to stay disabled to avoid interference during the database backup and restore process.

How to get there

From your initial dual-region deployment, your base Helm values file camunda-values.yml in aws/dual-region/kubernetes should still be present.

In particular, the values ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL and ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL should point to their respective regions. The placeholder in ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS should contain the Zeebe endpoints of both regions, the result of the aws/dual-region/scripts/generate_zeebe_helm_values.sh.

In addition, the following Helm command will disable Operate and Tasklist since those will only be enabled at the end of the full region restore. It's required to keep them disabled in the newly created region due to their Elasticsearch importers.

From the terminal context of aws/dual-region/kubernetes execute:

helm install $HELM_RELEASE_NAME camunda/camunda-platform \
--version $HELM_CHART_VERSION \
--kube-context $CLUSTER_RECREATED \
--namespace $CAMUNDA_NAMESPACE_RECREATED \
-f camunda-values.yml \
-f $REGION_RECREATED/camunda-values.yml \
--set operate.enabled=false \
--set tasklist.enabled=false

Verification

The following command will show the deployed pods of the newly created region.

Depending on your chosen clusterSize, you should see that half of the amount are spawned in Zeebe brokers.

For example, in the case of clusterSize: 8, you find four Zeebe brokers in the newly created region.

danger

It is expected that the Zeebe broker pods don't become ready as they're not yet part of a Zeebe cluster, therefore not considered healthy by the Kubernetes readiness probe.

kubectl --context $CLUSTER_RECREATED get pods -n $CAMUNDA_NAMESPACE_RECREATED

Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that the new Zeebe brokers are recognized but yet a full member of the Zeebe cluster.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING
zbctl status --insecure --address localhost:26500
Example Output
Cluster size: 4
Partitions count: 8
Replication factor: 2
Gateway version: 8.6.0
Brokers:
Broker 0 - camunda-zeebe-0.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Leader, Healthy
Partition 6 : Follower, Healthy
Partition 7 : Follower, Healthy
Partition 8 : Leader, Healthy
Broker 1 - camunda-zeebe-0.camunda-zeebe.camunda-paris.svc:26501
Version: 8.6.0
Broker 2 - camunda-zeebe-1.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 1 : Follower, Healthy
Partition 2 : Leader, Healthy
Partition 3 : Leader, Healthy
Partition 8 : Follower, Healthy
Broker 3 - camunda-zeebe-1.camunda-zeebe.camunda-paris.svc:26501
Version: 8.6.0
Broker 4 - camunda-zeebe-2.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 2 : Follower, Healthy
Partition 3 : Follower, Healthy
Partition 4 : Leader, Healthy
Partition 5 : Leader, Healthy
Broker 5 - camunda-zeebe-2.camunda-zeebe.camunda-paris.svc:26501
Version: 8.6.0
Broker 6 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0
Partition 4 : Follower, Healthy
Partition 5 : Follower, Healthy
Partition 6 : Leader, Healthy
Partition 7 : Leader, Healthy
Broker 7 - camunda-zeebe-3.camunda-zeebe.camunda-london.svc:26501
Version: 8.6.0