Multi-Node Testing Distributed Akka.NET Applications

One of the most powerful testing features of Akka.NET is its ability to create and simulate real-world network conditions such as latency, network partitions, process crashes, and more. Given that any of these can happen in a production environment it's important to be able to write tests which validate your application's ability to correctly recover.

This is precisely what the Multi-Node TestKit and TestRunner (MNTR) does in Akka.NET.

MNTR Components

The Akka.NET Multi-Node TestKit consists of the following publicly available NuGet packages:

Akka.MultiNodeTestRunner - the test runner that can launch a multi-node testing environment;
Akka.Remote.TestKit - the base package used to create multi-node tests; and
Akka.Cluster.TestKit - a set of test helper methods for Akka.Cluster applications built on top of the Akka.Remote.TestKit.

How the MNTR Works

The MultiNodeTestRunner works via the following process:

Consumes a .DLL that has Akka.Remote.TestKit or Akka.Cluster.TestKit classes contained inside it;
For each detected multi-node test class, read that tests' configuration and build a corresponding network;
Run the test, including assertions, process barriers, and logging;
Provide a PASS/FAIL signal for each node participating in the test;
If any of the nodes failed, mark the entire test as failed; and
Write all of the output for each test and for each individual node in that test into its own output folder for review.

Akka.NET MultiNodeTestRunner Execution

Given this architecture, let's wade into how to actually write and run a test using the MultiNodeTestRunner.

How to Write a Multi-Node Test

The first step in writing a multi-node test is to install either the Akka.Remote.TestKit or the Akka.Cluster.TestKit NuGet package into your application. If you're working with Akka.Cluster, always just use the Akka.Cluster.TestKit package.

Next, you need to plan out your test scenario for your application. Here's a simple one from the Akka.NET codebase itself we can use as an example:

Form a two-node cluster, "Seed1" and "Seed2" as the role names, using both nodes as seed nodes;
Restart the first seed node ("Seed1") process; and
Verify that the restarted "Seed1" node is able to rejoin the same cluster as "Seed2."

So this sounds like a rather complicated procedure, but in actuality the MNTR makes it easy for us to test scenarios just like these.

Step 1 - Create a Test Configuration

The first step in creating an effective multi-node test is to define the configuration class for this test - this is going to tell the MNTR how many nodes there will need to be, how each node should be configured, and what features should be enabled for this unit test.

public class RestartNode2SpecConfig : MultiNodeConfig
{
    public RoleName Seed1 { get; }

    public RoleName Seed2 { get; }

    public RestartNode2SpecConfig()
    {
        Seed1 = Role("seed1");
        Seed2 = Role("seed2");

        CommonConfig = DebugConfig(true)
            .WithFallback(ConfigurationFactory.ParseString(@"
              akka.cluster.auto-down-unreachable-after = 2s
              akka.cluster.retry-unsuccessful-join-after = 3s
              akka.remote.retry-gate-closed-for = 45s
              akka.remote.log-remote-lifecycle-events = INFO
            "))
            .WithFallback(MultiNodeClusterSpec.ClusterConfig());
    }
}

The declaration of the RoleName properties is what the MNTR uses to determine how many nodes will be participating in this test. In this example, the test will create exactly two test processes.

The CommonConfig element of the MultiNodeConfig implementation class is the common config that will be used throughout all of the nodes inside the multi-node test. So, for instance, if you want all of the nodes in your test to run Akka.Cluster.Sharding you'd want to include those configuration elements inside the CommonConfig property.

Configuring Individual Nodes Differently

In addition to passing a CommonConfig object throughout all nodes in your multi-node test, you can also provide configurations for individual nodes during each test.

For example: if you're taking advantage of the akka.cluster.roles property to have some nodes execute different workloads than others, this might be something you'd want to specify for nodes individually.

The NodeConfig method allows you to do just that:

NodeConfig(
    new List<RoleName> { First }, 
    new List<Config> { ConfigurationFactory.ParseString("akka.cluster.roles =[""a"", ""c""]") });
NodeConfig(
    new List<RoleName> { Second, Third }, 
    new List<Config> { ConfigurationFactory.ParseString("akka.cluster.roles =[""b"", ""c""]") });

Right after setting CommonConfig inside the constructor of your MultiNodeConfig class you can call NodeConfig for the specified RoleNames and each of them will have their Configs added to their ActorSystem configurations at startup.

Note

NodeConfig takes precedent over CommonConfig

Enabling TestTransport to Simulate Network Errors

One final but important thing you might want to during the design of a multi-node test is to enable the TestTransport, which exposes a capability inside your tests that allows for you to create network partitions, disconnects, and latency on the fly.

public class SurviveNetworkInstabilitySpecConfig : MultiNodeConfig
{
    public RoleName First { get; }
    public RoleName Second { get; }
    public RoleName Third { get; }
    public RoleName Fourth { get; }
    public RoleName Fifth { get; }
    public RoleName Sixth { get; }
    public RoleName Seventh { get; }
    public RoleName Eighth { get; }

    public SurviveNetworkInstabilitySpecConfig()
    {
        First = Role("first");
        Second = Role("second");
        Third = Role("third");
        Fourth = Role("fourth");
        Fifth = Role("fifth");
        Sixth = Role("sixth");
        Seventh = Role("seventh");
        Eighth = Role("eighth");

        CommonConfig = DebugConfig(false)
            .WithFallback(ConfigurationFactory.ParseString(@"
                akka.remote.system-message-buffer-size = 100
                akka.remote.dot-netty.tcp.connection-timeout = 10s
            "))
            .WithFallback(MultiNodeClusterSpec.ClusterConfig());

        TestTransport = true;
    }

To enable the TestTransport, all you have to do is set TestTransport = true inside the MultiNodeConfig constructor.

Once that's done, you'll be able to use the TestConductor inside your multi-node tests to enable all kinds of simulated network partitions.

Step 2 - Create Your MultiNodeClusterSpec or MultiNodeSpec

Once you've created your MultiNodeConfig, you'll want to create a MultiNodeClusterSpec if you're using Akka.Cluster or a MultiNodeSpec if you just want to use Akka.Remote.

We're going to show you a full code sample first and walk through how it works in detail below.

public class RestartNode2Spec : MultiNodeClusterSpec
{
    private class Watcher : ReceiveActor
    {
        public Watcher()
        {
            Receive<Address>(a =>
            {
                seedNode1Address = a;
                Sender.Tell("ok");
            });
        }
    }

    readonly RestartNode2SpecConfig _config;

    private Lazy<ActorSystem> seed1System;
    private Lazy<ActorSystem> restartedSeed1System;
    private static Address seedNode1Address;

    private ImmutableList<Address> SeedNodes => 
        ImmutableList.Create(seedNode1Address, GetAddress(_config.Seed2));

    public RestartNode2Spec() : this(new RestartNode2SpecConfig()) { }

    protected RestartNode2Spec(RestartNode2SpecConfig config) : base(config, typeof(RestartNode2Spec))
    {
        _config = config;
        seed1System = new Lazy<ActorSystem>(() => ActorSystem.Create(Sys.Name, Sys.Settings.Config));
        restartedSeed1System = new Lazy<ActorSystem>(
            () => ActorSystem.Create(Sys.Name, ConfigurationFactory
                .ParseString("akka.remote.netty.tcp.port = " + SeedNodes.First().Port)
                .WithFallback(Sys.Settings.Config)));
    }

    protected override void AfterTermination()
    {
        RunOn(() =>
        {
            Shutdown(seed1System.Value.WhenTerminated.IsCompleted ? restartedSeed1System.Value : seed1System.Value);
        }, _config.Seed1);
    }

    [MultiNodeFact]
    public void RestartNode2Specs()
    {
        Cluster_seed_nodes_must_be_able_to_restart_first_seed_node_and_join_other_seed_nodes();
    }

    public void Cluster_seed_nodes_must_be_able_to_restart_first_seed_node_and_join_other_seed_nodes()
    {
        Within(TimeSpan.FromSeconds(60), () =>
        {
            RunOn(() =>
            {
                // seed1System is a separate ActorSystem, to be able to simulate restart
                // we must transfer its address to seed2
                Sys.ActorOf(Props.Create<Watcher>().WithDeploy(Deploy.Local), "address-receiver");
                EnterBarrier("seed1-address-receiver-ready");
            }, _config.Seed2);


            RunOn(() =>
            {
                EnterBarrier("seed1-address-receiver-ready");
                seedNode1Address = Cluster.Get(seed1System.Value).SelfAddress;
                foreach (var r in ImmutableList.Create(_config.Seed2))
                {
                    Sys.ActorSelection(new RootActorPath(GetAddress(r)) / "user" / "address-receiver").Tell(seedNode1Address);
                    ExpectMsg("ok", TimeSpan.FromSeconds(5));
                }
            }, _config.Seed1);
            EnterBarrier("seed1-address-transfered");

            // now we can join seed1System, seed2 together
            RunOn(() =>
            {
                Cluster.Get(seed1System.Value).JoinSeedNodes(SeedNodes);
                AwaitAssert(() => Cluster.Get(seed1System.Value).State.Members.Count.Should().Be(2));
                AwaitAssert(() => Cluster.Get(seed1System.Value).State.Members.All(x => x.Status == MemberStatus.Up).Should().BeTrue());
            }, _config.Seed1);

            RunOn(() =>
            {
                Cluster.JoinSeedNodes(SeedNodes);
                AwaitMembersUp(2);
            }, _config.Seed2);
            EnterBarrier("started");

            // shutdown seed1System
            RunOn(() =>
            {
                Shutdown(seed1System.Value, RemainingOrDefault);
            }, _config.Seed1);
            EnterBarrier("seed1-shutdown");

            RunOn(() =>
            {
                Cluster.Get(restartedSeed1System.Value).JoinSeedNodes(SeedNodes);
                Within(TimeSpan.FromSeconds(30), () =>
                {
                    AwaitAssert(() => Cluster.Get(restartedSeed1System.Value).State.Members.Count.Should().Be(2));
                    AwaitAssert(() => Cluster.Get(restartedSeed1System.Value).State.Members.All(x => x.Status == MemberStatus.Up).Should().BeTrue());
                });
            }, _config.Seed1);

            RunOn(() =>
            {
                AwaitMembersUp(2);
            }, _config.Seed2);
            EnterBarrier("seed1-restarted");
        });
    }
}

First, take note of the default public constructor:

public RestartNode2Spec() : this(new RestartNode2SpecConfig()) { }

This is an XUnit restriction - there can only be one public constructor per class, and you need to pass in your MultiNodeConfig to the base class constructor, which is exactly what we do in the protected constructor.

protected RestartNode2Spec(RestartNode2SpecConfig config) : base(config, typeof(RestartNode2Spec))
{
    _config = config;
    seed1System = new Lazy<ActorSystem>(() => ActorSystem.Create(Sys.Name, Sys.Settings.Config));
    restartedSeed1System = new Lazy<ActorSystem>(
        () => ActorSystem.Create(Sys.Name, ConfigurationFactory
            .ParseString("akka.remote.netty.tcp.port = " + SeedNodes.First().Port)
            .WithFallback(Sys.Settings.Config)));
}

We're going to hang onto a copy of our RestartNode2SpecConfig class in a field called _config, which will be helpful when we need to look up RoleNames later.

Finally, we need to create our test method and decorate it with the MultiNodeFact attribute:

[MultiNodeFact]
public void RestartNode2Specs()
{
    Cluster_seed_nodes_must_be_able_to_restart_first_seed_node_and_join_other_seed_nodes();
}

This method is what will be executed by the multi-node test runner.

Addressing Nodes

All nodes in the multi-node test runner are going to be given randomized addresses and ports - thus we can never predict those addresses at the time we design our tests. Therefore, the way we always refer to nodes is by their RoleNames.

If we want to resolve the Akka.NET Address of a specific node, we can do this via the GetAddress method:

private ImmutableList<Address> SeedNodes => 
    ImmutableList.Create(seedNode1Address, GetAddress(_config.Seed2));

The GetAddress method accepts a RoleName and returns the Address that was assigned to the node by the multi-node test runner.

Running Code on Specific Nodes

The most important tool in the Akka.Remote.TestKit, the base library where all multi-node testing tools are defined, is the RunOn method:

RunOn(() =>
{
    // seed1System is a separate ActorSystem, to be able to simulate restart
    // we must transfer its address to seed2
    Sys.ActorOf(Props.Create<Watcher>().WithDeploy(Deploy.Local), "address-receiver");
    EnterBarrier("seed1-address-receiver-ready");
}, _config.Seed2);


RunOn(() =>
{
    EnterBarrier("seed1-address-receiver-ready");
    seedNode1Address = Cluster.Get(seed1System.Value).SelfAddress;
    foreach (var r in ImmutableList.Create(_config.Seed2))
    {
        Sys.ActorSelection(new RootActorPath(GetAddress(r)) / "user" / "address-receiver").Tell(seedNode1Address);
        ExpectMsg("ok", TimeSpan.FromSeconds(5));
    }
}, _config.Seed1);

Notice that the first RunOn call takes an argument of _config.Seed2, whereas the second RunOn call takes an argument of _config.Seed1. The code in the first RunOn block will only execute on the node with RoleName "Seed2" and the code in the second block will only run on RoleName "Seed1."

Given that each node runs inside its own process, it's likely that both of these blocks of code will be executing simultaneously.

Synchronizing Test Progression on Different Nodes

In order to make multi-node tests effective, we must have some means of synchronizing all of the nodes in each test - such that they all reach the same assertions at the same time. This is precisely what the EnterBarrier method helps us do, as you can see in the code sample above.

EnterBarrier creates a synchronization barrier between processes - no processes can advance past it until all processes have reached it. If one process fails to reach the barrier within 30 seconds (this is configurable via the Akka.Remote.TestKit reference configuration), the test will throw an assertion error and fail.

Terminating, Aborting, and Disconnecting Nodes

One of the most useful features of the multi-node testkit is its ability to simulate real-world networking issues, and this can be accomplished using some of the APIs found in the Akka.Remote.TestKit.

Creating Network Partitions In order to create a network partition between two or more nodes, the TestTransport must be enabled inside the MultiNodeConfig class constructor. This allows access to the TestConductor, which can be used to render two or more nodes unreachable:

RunOn(() =>
{
    TestConductor.Blackhole(_config.First, _config.Second, 
        ThrottleTransportAdapter.Direction.Both).Wait();
}, _config.First);
EnterBarrier("blackhole-2");

In this example, the TestConductor.Blackhole method is used to create 100% packet loss between the RoleName "First" and "Second". Those two nodes will still be running as part of the test, but they won't be able to communicate with each other over Akka.Remote.

The Task returned by TestConductor.Blackhole will complete once the Akka.Remote transport has enabled "blackhole" mode for that connection, which usually doesn't take longer than a few milliseconds.

To stop black-holing these nodes, we'd need to call the TestConductor.PassThrough method on these same two RoleName instances:

RunOn(() =>
{
    TestConductor.PassThrough(_config.First, _config.Second, 
        ThrottleTransportAdapter.Direction.Both).Wait();
}, _config.First);
EnterBarrier("repair-2");

This will allow Akka.Remote to resume normal execution over the network.

Killing Nodes There are two ways to kill a node in a running multi-node test.

The first is to call the Shutdown method on the ActorSystem of the node you wish to have exit the test. This will cause the ActorSystem to terminate gracefully - this simulates the planned shutdown of a node.

// shutdown seed1System
RunOn(() =>
{
    Shutdown(seed1System.Value, RemainingOrDefault);
}, _config.Seed1);
EnterBarrier("seed1-shutdown");

The other way to shutdown a node is to use the TestConductor.Exit command - this is intended to simulate the unplanned shutdown of a node, i.e. a process crash.

RunOn(() => {
    TestConductor.Exit(_config.Third, 0).Wait();
}, _config.First);

Once a node has exited the test, it will no longer be able to wait on EnterBarrier calls and the multi-node test runner will not try to collect any data from that node from that point onward.

Running Multi-Node Tests

Once you've coded your multi-node tests and compiled them, it's now time to run them. Akka.NET ships a custom XUnit2 runner that it uses to create the simulated networks and clusters and you will need to install that via NuGet in order to run your tests:

PS> nuget.exe Install-Package Akka.MultiNodeTestRunner -NoVersion

This will install the Akka.MultiNodeTestRunner NuGet package with the following directory and file structure:

root/akka.multinodetestrunner
root/akka.multinodetestrunner/lib/net452/Akka.MultiNodeTestRunner.exe
root/akka.multinodetestrunner/lib/netcoreapp1.1/Akka.MultiNodeTestRunner.dll

Depending on what framework you're building your application against, you'll want to pick the appropriate tool (.NET Framework or .NET Core.)

Next, we have to pass in our command-line arguments to the MNTR:

Akka.MultiNodeTestRunner.exe [path to assembly] [-Dmultinode.enable-filesink=on] [-Dmultinode.output-directory={dir path}] [-Dmultinode.spec={spec name}]

We strongly recommend setting the -Dmultinode.output-directory={dir path} directory to some local folder you can access, as the multi-node test runner will emit:

An output file for the entire test run of the DLL and
For each individual spec, a subfolder that contains logs pertaining to the original node.

Hint: Each test run will append new log entries to output files. If this is not desired, you can pass -Dmultinode.clear-output=1 option to delete output folder before MNTR will run tests.

If you're lost and need more examples, please explore the Akka.NET source code and take a look at some of the MNTR output produced by our CI system on any open pull request.

Debugging Failed Tests

As already mentioned, after each spec is finished, test runner wil emit log files for it to output directory subfolder with the full name of the spec. In this folder you will find individual logs for each node (named according to the roles they were assigned), and aggregated.txt file, which contains all nodes logs aggregated into single timeline.

Also, FAILED_SPECS_LOGS subdirectory will be generated. If any of your specs failed, this folder will contain aggregated logs for each spec - basically, the same aggregated.txt files but with their spec's names. This is a good place to get a full picture of what has failed and why.

Also, -Dmultinode.failed-specs-directory={failed spec dir} option could be used to override FAILED_SPECS_LOGS name.