Understanding the behavior of a distributed system requires us to trace actions across nodes. For example, imagine web search is slow at Google, and an engineer wants to figure out why. The engineer will have a hard time pinpointing the cause of slowness because
Dapper is Google's distributed tracing system which aims to be ubiquitous (i.e. all processes in Google use Dapper) and always on (i.e. Dapper is continuously tracing). Dapper aims to have low overhead, achieve application-level transparency (i.e. application developers shouldn't have write tracing code themselves), be scalable, and enable engineers to explore recent tracing data.
A distributed tracing system needs to track all the work done on behalf of some initiator. For example, imagine a web server which issues an RPC to some cache. The cache then issues an RPC to a database and also to a lock service like Chubby. The database issues an RPC to a distributed file system. A distributed tracing service should track all of these interactions which were all initiated by the web server.
There are two main ways to perform distributed tracing.
Dapper represents a trace as a tree. Each node in the tree is a span. Each span corresponds to the execution of a single RPC on a single machine and includes information about the starting and stopping time of the RPC, user annotations, other RPC timings, and so on. Each span stores a unique span id, the span id of its parent, and a trace id, all of which are probabilistically unique 64-bit integers.
In order to achieve ubiquity and application-level transparency, Dapper instruments a few core libraries within Google. Most importantly, Dapper implements ~1,500 lines of C++ code in the RPC library. It also instruments some threading and control flow libraries.
Dapper allows application developers to inject their own custom annotations into spans. Annotations can be strings or key-value mappings. Dapper has some logic to limit the size of a user's annotations to avoid overzealous annotating.
Applications that are instrumented with Dapper write their tracing information to local log files. Dapper daemons running on servers then collect the logs and write them into a global BigTable instance in which each row is a trace and each column is a span. The median latency from the time of a span to its appearance in the BigTable is 15 seconds, but the delay can be much larger for some applications.
Alternatively, traces can be collected in-band. That is, Dapper could add tracing information to the responses of RPCs and spans could build up traces as the RPCs execute. This has two main disadvantages.
Dapper traces could be more informative if they included application data, but for privacy reasons, including application data is opt-in. Dapper also enables engineers to audit the security of their systems. For example, Dapper can be used to check whether sensitive data from one machine is being sent to an unauthorized service on another machine.
Dapper has been running in production at Google for two years. The authors predict that nearly every process at Google supports tracing. Only a handful of applications have to manually adjust Dapper to work correctly and very few applications use a form of communication (e.g. raw TCP sockets) that is not traced by Dapper. 70% of spans and 90% of traces have at least one user annotation.
Dapper aims to introduce as little overhead as possible. This is especially important for latency- or throughput-sensitive services.
Dapper provides an API, known as the Dapper Deopt API (DAPI), to access the global BigTable of all trace data. DAPI can query the trace data by trace id, in bulk using a MapReduce, or via a composite index on (service name, host machine, timestamp). The returned trace objects are navigable trees with Span objects as vertices.
Dapper also provides a web UI that engineers can use to explore trace data and debug applications.