Distributed tracing, also called distributed request tracing, is a monitoring technique that is suitable for tracking the performance of applications that are composed of microservices. The point of contact with a microservice is an Application Programming Interface (API) – more accurately, a REST API. REST stands for Representational, State Transfer. It is a standard for communication between computer systems that operate on the Web.
REST APIs are particularly useful when writing apps for mobile devices because they split the software between an interface, which is resident on the device, and the backend processes, which run on a remote server. Those processes are microservices and the REST API manages the communication between them and the user interface.
When a program calls a function, the call usually requires values to be fed into parameters and a local variable to catch the results that come back from the function when it completes. The main difference between functions and microservices is that the function completes its task, returns the result, and then ends but microservices keep running all the time, like a daemon.
A microservice can serve many apps. The main intention of microservices is to perform a service, rather than just a single action. For example, a service could involve managing access to a database – storing data and then retrieving it later on-demand.
So, an app calls a series of microservices to perform all of the hard work. However, that handful of microservices might not be linked or even resident on the same server.
An app can call microservice A, which is resident on a server in Amsterdam, receives results and then feeds them to microservice B, which is running on a server in San Francisco, dealing with another request from the user by interfacing with microservice C, resident on a server in Sydney, and displaying the results on the screen. When it receives back results from microservice B in San Francisco, the app requests stored reference data from microservice A in Amsterdam, then makes a transaction with microservice D, which is running on a server in London. When microservice D returns a success status, the app sends messages to microservices A and B for them to update their records.
This process can be simplified by the writer of the app to further reduce the amount of processing performed on the device. In this scenario, a user action gets fed through to microservice E, which is resident on a server in Mumbai and deals with the calls to microservices A, B, and D. Microservice C is already pretty straightforward, so it could be left as a direct API call from the app.
A Web page that is accessed by a desktop or laptop Web browser to offer the same service as the companion mobile app can also call on microservices C and E. So, the same interface requests can be dealt with through two different environments – an app and a Web page loaded into a browser.
The complexity of APIs calling other APIs and not having the constraints of one server as the host of all of those services can create a headache for performance monitoring. Typically, an application monitor would notice if the API to microservice E is taking a long time to come back with results. However, the APM would not be aware of why that API is slow. It can’t see through the call to the inner workings of the microservice behind the one line of the API. It isn’t even able to detect the existences of microservices A, B, and D, and has no way of detecting that the servers in Amsterdam, London, and San Francisco got involved. The performance monitor can’t supervise the statuses of those hidden servers or even get the opportunity to check connection quality between them and the server in Mumbai.
The obscurity of what APIs do is part of their success. The producers of services want to be able to make their assisting software available to the developers of applications without allowing those developers to see and steal the code that implements the service.
The developers are usually glad to find a library of services that removes all of the work of writing large chunks of their application. The producers of apps and Web pages are very happy to offload most of the processing demand onto a remote server.
So, everyone is happy with the microservices architecture except for when things go wrong and that’s where distributed tracing comes in.
The need for distributed tracing
The answer to monitoring through APIs to check on the performance of microservices is to implement a trace on all of the distributed processing that the call to the API provokes.
Distributed tracing is a reporting channel that enables a top-level app to get a chain of statuses back from every microservice that gets called in order to complete the task requested by an API. This feedback can also be picked up by a monitoring tool.
Problems with microservice reporting
One problem with a distributed tracing strategy is that much of the detail depends on the amount of log messaging built into an underlying microservice. Even with automated error reporting, the writers of a higher level microservice can decide to trap those errors, thus, making it impossible for any monitoring device to see the real reason for a problem. In these cases, the monitor will know that there is a problem with the API that the endpoint launched but won’t know that the error occurred with an underlying service or where that service runs.
In many ways, the purpose of an API-based architecture is to hide the inner workings of software and that gives the upper microservices a lot of power over what status messages can be seen by the user-facing app.
Another problem is that reporting is layered, so the statuses and processing information from one microservice won’t be released for examination by higher-level services if that process hangs or crashes.
Distributed tracing strategy
In order to try to get around the control that higher-level microservices have over the reporting mechanisms in lower-level microservices, distributed tracing can operate through the environment rather than through status messages embedded in the code.
Distributed tracing has its own conventions and terminology. There are three important terms that you need to know when investigating distributed tracing:
- Trace – The request for status feedback from a top-level API.
- Span – A work unit, which represents a status report from one service. This has a start and end timestamp and associated metadata.
- Tag – an identifier/label for a span.
A typical trace has a tree structure, which reflects the parent/child relationship between microservices and those other services that each calls. The Tag on report messages reflects that structure, enabling a monitor to identify all of the layers involved in servicing an API call and, optionally, draw up a diagram of the group.
The microservice at the top of the tree offers a “trace root.” When a microservice launches another, that call is known as an “exit span.” This term applies whether the service being called is within the same package or an external service that might be on another server. The opening report of each microservice is called an “entry span.”
All events, including the entry span and the exit span, are grouped together as a “process boundary.” Any event between an entry span and an exit span is termed an “in-process span.”
The data communication standard, Hypertext Transfer Protocol (HTTP) includes a mechanism for distributed trace requests and responses. This builds up a hierarchical series of identifiers. The initial request gets a Trace ID embedded in the HTTP header that calls an API. The microservice (“process”) behind that API adds its own identifier to that and then an identifier for each step (“span”) that it executes. Therefore the Trace ID is like a session ID and each level of service that contributes to fulfilling it adds its own ID to the chain.
However, it is possible for the managers of microservices to turn that reporting capability off – so, ultimately, the success of a trace relies on the cooperation of all of the suppliers of microservices.
Categories of distributed tracing messages
There are three ways in which a distributed tracing record can be generated:
- Code trace – A message written into the microservice’s code by the programmer.
- Data trace – A complicated method of verifying the data that gets generated by low-level processes. It is particularly concerned with Critical Data Elements (CDEs). A process doesn’t always return data to a parent process but inserts it into a database for later reference. However, that process could produce incorrect results and data tracing is a way to check data in the meantime. Data tracing produces a lot of processing overhead and isn’t often implemented. An alternative solution is Statistical Process Control (SPC).
- Program trace – Debug information including the language a process was written in, the actions it performed, and the services it used.
Distributed tracing produces masses of data and that could overload a storage-starved mobile device. Storage space on servers is very cheap and plentiful but on mobile devices, storage space is in short supply, and processing capacity is even harder to come by. For this reason, the management of trace data is offloaded to a remote server by the trace root.
A large number of trace messages needs to be reduced and one automated method of doing that is through sampling. Another method is through log aggregation. This process involves storing trace messages in a database, which can then be queried to extract summaries.
Distributed logging is an alternative strategy for centralized logging. This is a more efficient way of dealing with communication resources because log messages can be stored on the server that hosts the generating process.
Distributed logging has its merits and its detractions. Holding log messages on many servers can be a better way of dealing with very large mobile applications, such as online games that involve thousands of live players around the world. However, the system complicates the process of examining trace messages. It also makes consolidation almost impossible unless the initial store is held local to the process and then summaries are extracted off from each distributed trace storage and amalgamated in a centralized store.
A big advantage of distributed tracing is that it provides a common reporting format for software that is written in different programming languages. However, that commonality exists in theory only because there are a number of competing protocols for distributing tracing standards. Fortunately, the field of tracing standards is narrowing.
The “open source” movement is becoming widely adopted in many fields of IT, including connection privacy (OpenVPN) and security monitoring (OSSEC). Distributed tracing had two major open standards, which have now merged. These two protocols were OpenTracing and OpenCensus. Now, they operate as one system, which is called OpenTelemetry.
OpenTracing is a product of the Cloud Native Computing Foundation (CNCF). OpenCensus is a Google product, based on its own in-house distributed tracing service, called Dapper. While software houses preferred OpenTracing, Google got the big tech giants onboard by open-sourcing its system. For example, Microsoft adopted OpenCensus. The combined standard, OpenTelemetry is managed by the CNCF. However, it is still in development. It isn’t clear whether legacy systems deploying OpenTracing will be able to communicate with existing systems that use OpenCensus.
While the open-source options for distributed tracing are reducing, there are still many proprietary standards. Kafka is a free-to-use distributing tracing platform managed by the Apache Software Foundation – this system needs to be hosted. AWS has its own distributed tracing service, called X-Ray. Some log management and private distributed tracing services add on their own messaging conventions – New Relic is one such service.
Distributed tracing was created in order to simplify the complicated task of monitoring microservices. However, the competing systems that offer this service make the whole field very complicated.
Distributed tracing tools
Distributed tracing log management tools are useful for developers of apps and microservices so that they can see what errors they made in calling a microservice and also try different utilities that a service offers to optimize the speed of their new programs.
The use of a trace management service is not free, however, and it is important to get a service that will allow messages to be filtered and sorted. Such a service would be half-way to a full monitoring tool. A distributed tracing monitor includes trace message consolidation, interpretations, and prioritization.
Some distributed tracing monitor providers don’t like to refer to their source material as log messages. They want to define their field as a separate entity. It isn’t log management and it isn’t debugging – it’s something that encapsulates logging and debug messaging and more besides.
We have a more in-depth post with reviews of the best distributed tracing monitoring tools. Below is a brief summary of some of the best distributed tracing management systems or platforms to consider.
Site24x7’s APM can focus on the activities occurring on a specific server or monitor the activities of a website or app. This service includes microservice monitoring with distributed tracing. This monitoring service shows an overview of each transaction with response times and a graph that indicates which microservices contributed to that processing period. A drill-down feature shows each microservice and its metrics and then the option to look at each message generated by that microservice while working for the transaction.
New Relic is a prominent supplier of distributed tracing monitors. It adds its own message tagging system to standard distributed tracing techniques implemented through HTTP headers. The base protocols for this system are OpenCensus and OpenTelemetry. A series of New Relic Agents add to the statistics that come out of the open-source systems and also extend monitoring capabilities to AWS Lambda and add on browser-based tracing services. New Relic can also process trace data collected by other applications.
Datadog’s application performance monitor includes distributed tracing monitors. Datadog has its own standard for distributed tracing but it is also able to work with OpenTracing and OpenTelemetry standards. It has a specialized interface to collect data from AWS Lambda as well. The Datadog APM interface is able to generate Service Maps, which are very useful visual representations of the relationships between active microservices.
Lightstep specializes in distributed tracing – this is the company’s only product. Lightstep was founded by one of the managers of Google’s Dapper project and takes distributed tracing to the next level with this APM-like system. The Lightstep dashboard presents really impressive live microservice relationship visualizations and status graphs. Drill down investigative tools help you identify the root cause of performance problems.