The first step consists of comparing the gossip lists and replacing the heartbeat value if the received value is lower. Node 1 therefore updates its heartbeat values for nodes 2, 3 and 4. Note that, for these nodes, the received suspect matrix contains more recent suspect lists than Node 1 currently has in it's own suspect matrix. This insight is used when updating the suspect matrix in third step. The first step is illustrated in the following figure:
Second, Node 1 updates its suspect list based on the new gossip list. Each heartbeat value is compared to the cleanup time, which is the threshold value for nodes to be suspected. The result is that Node 3 is no longer suspected, as the new heartbeat value (5) is lower than the cleanup time:
Following that, Node 1 updates its suspect matrix. This step uses the insight that a lower heartbeat value indicates more recent communication. Further, each row in the suspect matrix corresponds to the suspect list of the respective node. This means that the first row in the suspect matrix is equal to the suspect list of Node 1. Also, each row in the received suspect matrix is more up to date if the heartbeat value for that node was lower and has been updated.
Because the heartbeat values of nodes 2, 3 and 4 were updated, the second, third and fourth rows are updated from the received suspect matrix. This is illustrated by the following figure:
After these steps, Node 1 will check for consensus about all suspected nodes. In this case, Node 1 checks the column of Node 4. Because all live nodes suspect Node 4, consensus is reached and the failure is detected. The following figure shows the final state of Node 1:
The paper discussed earlier used the system in a high-performance computing environment. As this environment differs from our application environment, in which we use servers communicating over the Internet, some additions and changes were made to improve the results.
When a node detects that consensus is reached about a failure, it broadcasts this across the network. However, this can happen multiple times for a single failure, as the consensus can be detected by multiple nodes simultaneously. To prevent multiple alerts when this happens, only the live node with the lowest identifier will send the consensus broadcast and alert. When this node fails, the next live node will take this role. The other nodes will know an alert has been sent when they receive the consensus broadcast.
Layering of nodes allows for improved scalability by grouping nodes together. Nodes within one group communicate using the gossip protocol as described. In addition to that, the groups of nodes form higher-level layers, which gossip less frequently with each other. Each round, one node from each group gossips to a node in another group. This way, the groups get liveness information about each other and failures of whole groups are detected.
We decided that layering is beyond the scope of this project. Implementing support for layering will increase the complexity of the software, while the advantage of scalability is not needed. The network size we are interested in can work within one group, making layering unnecessary.
We made a proof-of-concept implementation and ran benchmarks with it. The results were compared with the current monitoring system. We found the failure detection time can be as low as 30 seconds, where our current monitor has a minimum detection time of 1 minute. However, the improved detection time did show to have the disadvantage of increased resource usage on the servers running the monitor.
Another disadvantage is the development effort required to implement and maintain the distributed monitoring software. Compared to an external, centralised monitoring solution, distributed monitors takes more effort to implement, deploy and maintain, because of the higher complexity of the system.
Finally, because the monitoring software is running on the server instead of externally, the resource usage of the server can be monitored as well. This can provide more information in case of a failure, and from monitoring the system health, possible failure causes can be detected early, helping to prevent downtime.
Using distributed monitoring as an alternative to centralised monitoring can be beneficial. We discussed an algorithm that can be used for distributed monitoring and showed the lower failure detection time trades off in increased resource usage and system complexity.
These differences, as well as the infrastructure that will be monitored, determine which solution is best. From April to July, I have been researching this subject, as part of my Master's degree in Software Engineering at the University of Amsterdam. My thesis should help making an informed decision when choosing for a monitoring solution. It can be found in the UvA digital library and contains more in depth information on this subject and the benchmarking measurement that lead to these conclusions.