Simple Testing Can Prevent Most Critical Failures

Updated on 2021-08-20

This paper studies 198 randomly sampled, user-reported failures of five data-intensive distributed systems that were designed to tolerate component failures: Cassandra (NoSQL distributed database, peer-to-peer system), HBase (NoSQL distributed database, master-slave design), Hadoop Distributed File System (master-slave design), Hadoop MapReduce (distributed data-analytic framework, master-slave design), and Redis (distributed key-value store supporting master/slave replication).

Here are major findings.

  • The error manifestation sequences tend to be relatively complex, requiring an unusual sequence of multiple events with specific input parameters from a large space to lead the system to a failure.

  • Almost all (92%) of the catastrophic system failures (i.e., failures that affect a majority of users) are the result of incorrect handling of non-fatal errors explicitly signaled in software. Indeed, it is well-known that error handling code is often buggy.

  • In 35% of the catastrophic failures, the faults in the error handler code fall into three trivial patterns: 1) the error handler is simply empty or only contains a log printing statement, 2) the error handler aborts the cluster on an overly-general exception, and 3) the error handler contains expressions like “FIXME” or “TODO”.

  • 74% of the failures are deterministic in that they are guaranteed to manifest with an appropriate input sequence.

  • Almost all failures are guaranteed to manifest on no more than three nodes and in most cases no more than 3 input events.

  • 77% of the failures can be reproduced by a unit test.

  • The types of input events that led to failures are: starting up services, unreachable nodes, configuration changes, adding a node.

Existing testing techniques for error handling logic primary use a “top-down” approach, in which the system is started using testing inputs or model-checking, and errors are injected actively at different stages. The paper’s findings suggests that a “bottom-up” approach could be effective to find remaining bugs, where one starts from existing error handling logic and tries to reverse engineer test cases that trigger them. For instance, symbolic execution techniques could be extended to purposefully reconstruct an execution path that can reach the error handling code block.

References

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems (OSDI 2014)