Docs Menu
Docs Home
/ /

Simulate Regional Outage

Note

This feature is not available for any of the following deployments:

  • M0 Free clusters

  • Flex clusters

To learn more, see Limits.

You can use the Atlas UI and API to simulate an outage on your Atlas multi-region cluster and observe how your application handles an outage in one or more regions. You can also run multiple simulations. When running multiple simulations, we recommend a five minute interval between simulations.

To start an outage simulation, you must have Organization Owner or Project Owner access to the project.

When you submit a request to test an outage using the Atlas UI or API, Atlas simulates an outage event. During a simulated outage, Atlas:

  • Removes network connectivity to nodes in the selected regions.

  • Does not trigger a monitoring alert for Replica set has no primary.

  • Automatically ends the simulation after a configurable expiration period (1, 3, or 7 days).

If your application takes more than 15 minutes to notice connection loss to some nodes, we recommend that you reduce your TCP retransmission timeout values. To learn more, see modify tcp_retries2 value.

To simulate a Regional Outage in the Atlas UI:

1
  1. If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.

  2. If it's not already displayed, select your desired project from the Projects menu in the navigation bar.

  3. In the sidebar, click Clusters under the Database heading.

The Clusters page displays.

2
  1. For the cluster you wish to perform outage testing, click the ... button.

  2. Click Test Resilience.

  3. Select Regional Outage. Atlas displays a Test Resilience modal with the steps Atlas takes to simulate an outage event. To learn more, see Simulate Regional Outage Process.

3
  1. Click Select Regions.

  2. Select the tab corresponding to the type of outage you want to simulate:

    Select fewer than half of your electable nodes.

    Select at least one more than half of your electable nodes and keep at least one electable node remaining.

    After selecting a majority of your electable nodes, your replica set won't have a primary node. This means that your replica set can't perform write operations and read operations that are not configured with a suitable readPreference.

  3. (Optional) From the Simulation Duration dropdown, select the duration for the simulation to run before it automatically expires. This value defaults to 3 days.

  4. Select Simulate Regional Outage to begin the test.

    Atlas notifies you when the outage occurs.

4

The simulation ends automatically once the duration set for the Simulation Duration is reached. You can also manually end the simulation earlier.

Note

Atlas checks for expired simulations in 24-hour intervals so it may take up to an additional day after the expiration date for the simulation to fully resolve.

Select a tab corresponding to the type of outage you are performing:

When you finish testing the outage, click End Simulation.

When you finish testing the regional outage, you can perform one of the following:

You can use the Test Outage API endpoint to simulate an outage event. To learn more about the outage process, see Simulate Regional Outage Process.

To verify that the outage is successful, monitor your application and ensure your read and write operations are working as expected.

A regional outage or regional outage simulation that affects the highest priority regions in a sharded cluster could cause the cluster to become inoperable for read operations. To restore the config servers, do the following:

  • Configure a read preference that is suitable for querying secondary nodes for reads.

  • Reconfigure the cluster for regaining electable nodes.

Back

Test Primary Failover

On this page