Best Distributed Tracing Tools

Microservices architecture offloads processing requirements from apps. This is a necessary step when catering to mobile devices where processing and storage space are at a premium. Microservices are accessed through REST APIs.

Each microservice performs a service, such as managing database access, membership services, invoicing functions, and so on. However, it is common for a microservice to employ others in order to complete a task. So, there can be many layers of microservices contributing to a successful transaction.

Here is our list of the five best distributed tracing tools:

  1. Site24x7 APM EDITOR’S CHOICE A package of application monitors that will monitor the functions operating on a server or the applications contributing to a website or app. This monitoring system includes a section for monitoring each app’s performance that illustrates the response times of all the microservices backing it.
  2. New Relic Telemetry Data Platform A thorough microservices monitoring system that is based on distributed tracing. New Relic adds in its own agents to supplement the standard distributed tracing messages with its own, more detailed statistics.
  3. Datadog APM A general application performance monitoring service that has a specialized microservices monitoring feature with extra data gathering services.
  4. Lightstep A microservices monitoring tool that was written by one of the designers of Google’s in-house distributed tracing platform.
  5. Dynatrace An AI-based application performance monitor that specialized in Clod and Web-based systems and has distributed tracing management built-in.

Abstraction

Every system designer and programmer knows about abstraction. It is a black-box strategy that means you can split off useful pieces of code so that they can be reused again and again and optimize the investment that goes into building software.

Once a piece of code can complete a task successfully, it can be stored individually and represented by its declaration. All anyone needs to know is what it does, what inputs it requires, and what type of data it returns. After that, it is like a command that can be plugged into any other program. Software houses build up libraries of functions and when they make those available to other developers, they are called APIs.

APIs take care of tasks without having to reinvent the wheel with every new piece of software. However, the big black-box advantage is also a curse. Functions that are made available for sale or rent can’t be read. They are usually even hosted by the software house that developed them. So, when a new app gets written with functionality provided by APIs the processing is performed God-knows-where.

That’s great for getting things done, but it is a nightmare for performance monitoring. Monitoring tools need access rights and third-party API providers are not going to let their customers run analyzers through their code.

All a performance monitor can track is the start and finish times of an API and the results they returned. They can’t see into the operations that the API’s backend code performs or see whether the microservice includes APIs to other microservices. It is very common for microservices to be built on many layers and each microservice could well be run on a different server located in a different part of the world.

Distributed tracing

Distributed tracing is an industry method to allow developers to monitor the performance of the APIs that they use without actually being able to analyze the backing microservice’s code. There are many protocols available for distributed tracing, which complicates a service that is intended to simplify a complicated problem.

The IT industry is solidifying around a few distributed tracing open standards.

  • OpenTracing is a product of the Cloud Native Computing Foundation (CNCF).
  • OpenCensus is a Google product, based on its own in-house distributed tracing service, called Dapper.
  • OpenTelemetry is a merger of OpenTracing and OpenCensus that is still under development. Managed by the CNCF.

These are the main distributed tracing standards. However, there are others.

There are a number of independent, free-to-use distributed tracing platforms available. Among these are Kafka, managed by the Apache Software Foundation, Jaeger, with libraries of tracing functions for C#, Java, Node.js, Python, and Go; and Zipkin (OpenZipkin) which has libraries for more languages, including Java, Javascript, C, C++, C#, Python, Scala, and Go.

The variety of available standards makes it difficult to track all microservices because the service being traced needs to post tracing messages according to one standard or another. If you have a distributed tracing tool that is based on Jaeger, you will miss out on status messages generated for Zipkin. Complicating matters further, AWS has its own proprietary system, called X-Ray in order to monitor its Lambda microservices platform.

What is telemetry?

A lot of distributed tracing systems producers use the word “telemetry” in their names and descriptions. “Telemetry” is not a word that was invented for IT – it exists in other areas of life. For example, you will see signs for the Telemetry Department in hospitals. The term comes from two Greek words, “tele,” which means “remote” and “metron,” which means “to measure.” In healthcare, a telemetry unit is a mobile heart monitor. In IT, the term could refer to any remote monitoring system but it has become specifically associated with distributed tracing for the monitoring of microservices.

A characteristic of telemetry is that it uses a parallel channel for status reporting. That means it doesn’t run through the code itself, but works alongside a running process and gathers statistics independently.

The usefulness of that strategy lies in the fact that, ordinarily, a program that falls over doesn’t get to the line of code that says “report a major problem.” Similarly, a process that is still alive but waiting for a resource (hanging) is stuck at a particular line of the program and can’t get out to say “I’m trapped here.” To account for that situation, the program includes routines to report “I’m still working fine.” So, when the monitor stops receiving those messages, it knows that something has gone wrong. A telemetry-based system is able to continue operating even when the process it monitors is in trouble.

See also: Distributed Tracing Guide

Distributed tracing tools

The best distributed tracing tools are able to detect and interpret messages written to a number of common microservice status reporting standards. There are three types of distributed tracing tools:

  • Trace message collectors
  • Trace message consolidators
  • Distributed tracing monitors

The task of collecting distributed tracing messages is a specialized service. It needs a tool that knows where to look for messages and can recognize their formats.

Trace message consolidation and storage is a service that could be implemented with many general-purpose log file managers, such as Splunk. The producers of distributed tracing tools don’t like the tracking system to be referred to as logs. However, this might be because they don’t want to have to compete with the much larger field of logfile managers.

Distributed tracing monitors are the highest form of distributed tracing tools because they produce a full information service for managing applications based on microservices.

Microservices architecture offloads processing requirements from apps. This is a necessary step when catering to mobile devices where processing and storage space are at a premium. Microservices are accessed through REST APIs.

Each microservice performs a service, such as managing database access, membership services, invoicing functions, and so on. However, it is common for a microservice to employ others in order to complete a task. So, there can be many layers of microservices contributing to a successful transaction.

Distributed tracing monitors

Distributed tracing monitors are the ultimate tools for distributed tracing because they include all three elements needed to track the performance of microservices. They will collect, manage, and interpret distributed tracing messages, presenting live statuses. They also store records in a meaningful file structure or in a database so that they can be easily accessed for historical analysis. A good distributed tracing monitor will also include a message viewer and interpreter for root cause analysis.

One problem with distributed tracing is that microservices can generate a lot of distributed tracing messages – most of them are just progress records, logging the time each service starts working for a particular session and “keep-alive” type message to let the monitor know that the service is still processing. A good distributed tracing tool will filter out these workaday notifications, or interpret them as graphs.

It is possible to use a logfile manager and set up search scripts to filter and group messages. However, this is a lot of work and it is better to let an automated monitoring tool do the work for you.

The best distributed tracing monitoring tools

The producers of the best distributed monitoring systems have been able to exploit many of the built-in features of the trace message format to produce very comprehensive and attractive monitoring services.

1. Site24x7 APM

Site24x7 APM

Site24x7 APM is a cloud-based service that is ideally suited to remote monitoring techniques, including distributed tracing for microservices. The company doesn’t reveal which protocols it is capable of interpreting. However, it is able to monitor services written in Java, .NET, and Node.js.

The APM is structured to monitor applications running on a server, a website’s activities, and apps for mobile devices. The standard APM package will track the activities of three applications that depend on microservices. This number can be bumped up with add-on fees. The same package also gives you the capabilities to monitor 40 websites and servers in any combination – for example, 20 servers and 20 sites or 20 sites and 10 servers, and so on.

The APM Insights service has a distributed tracing section that gives an overall view of an application’s performance. This includes a graphic representation of the total processing time of the application segmented by the time taken by each contributing microservice. You can drill-down to observe the performance of each individual microservice and then get further in to see each individual message that service generated.

Site24x7 APM is a subscription service and you can access it on a 30-day free trial.

EDITOR’S CHOICE

Site24x7 APM is our top pick for a distributed tracing tool because it includes a wide range of performance monitoring and lag management utilities as well as microservice tracking. The combination of all of the monitoring techniques deployed by the APM means that it is a great tool for supervising apps written for Android and iOS while also examining how those same microservices perform when called from websites.

Get a 30-day free trial: site24x7.com/signup.html

Operating system: Cloud-based

2. New Relic Telemetry Data Platform

New Relic Insights dashboard

The New Relic Telemetry Data Platform is geared towards developers of apps as well as businesses that want to monitor their microservices infrastructure. The company stresses the affordability of its storage space, which shows that it understands that this type of service involves sorting through masses of data.

New Relic centralizes data collection from many sources, including distributed tracing messages generated through OpenTelemetry, OpenTracing, OpenCensus, and Zipkin. Other data sources for the monitoring system include log files from applications and infrastructure devices plus a long list of AWS services, such as Lambda, and Azure, Apache, and operating system status reports.

The New Relic application monitoring service deploys its own agents to add extra insights into web and app performance, driven by microservice actions. These include browser monitors and connection testers.

New Relic Data Telemetry Platform has a free tier that will process up to 100 GB of data per month.

3. Datadog APM

Datadog APM

Datadog APM is a very similar service to that offered by New Relic. This is a cloud-based application performance monitor that includes many types of source data, including distributed tracing messages.

Datadog APM is able to collect and process OpenTracing and OpenTelemetry messages. The service files these messages along with other indicators. Datadog APM also collects its own statistics on microservice performance and application processing environments with agents, in a similar way to the method used by New Relic. In addition to monitoring standard distributed tracing messages, Datadog APM is able to interface with a range of AWS services, including the Lambda microservices platform.

A great feature of Datadog APM is the way it processes the data that it collects. It is able to generate visual representations that show the connections between microservices operating live in a hierarchy to fulfill requests from a given application. This is called the Service Maps system and it is very impressive. The Service Maps act as an index to all operating microservices that can be zoomed in on for greater detail.

Datadog APM is a subscription service and you can get it on a 14-day free trial.

4. Lightstep

Lightstep

The founder of Lightstep was one of the designers of Google’s Dapper in-house distributed tracing platform. Dapper was open-sourced as OpenCensus, becoming the favored distributed tracing tool of many major IT businesses, including Microsoft.  Now, OpenCensus is a part-contributor to the new, unified OpenTelemetry system.

Lightstep bases all of its monitoring work on distributed tracing, so it won’t integrate with monitors for other processing activities. However, Lightspeed does microservices monitoring very well and that’s why it gets onto our list of the very best distributed tracing tools.

The graphical representations used by Lightstep to interpret distributed tracing data is the key advantage of this tool. It displays the hierarchy of operating microservices serving an application, which is called the Operations Diagram. This leads to microservice performance data that also shows resource usage graphs and response time visuals.

Lightstep is a subscription service that is delivered from the Cloud. It is available in three editions: Community, Pro, and Enterprise. The Community edition is free to use and the Pro version is available on a 14-day free trial.

5. Dynatrace

Dynatrace

Dynatrace is an AI-driven application performance monitor that is delivered from the Cloud. While covering many methods for status detection, Dynatrace applies machine learning and heuristics to identify important information from the large amounts of data that most reporting and logging systems generate.

Among the data sources for the Dynatrace system is the distributed tracing open standard OpenTracing. Dynatrace collects and processes activities that contribute to a given application. It then tracks back, by interpreting distributed tracing messages to identify all of the microservices that worked on a session for that application.

The support that Dynatrace gives to system managers and developers enables new microservices to be written efficiently and tested. Systems managers can let Dynatrace monitor application performance because it will raise an alert if it identifies a problem with a microservice.

You can experience the Dynatrace application monitor on a 15-day free trial.

Choosing a distributed tracing monitoring tool

All distributed tracing standards include a priority field in the message structure. This is a great help because it means that distributed tracing monitors can quickly categorize the statuses of all microservices working on a session and generate alerts when high priority messages come through. Alert-based systems allow technical staff to get on with other tasks, assuming that all microservice operations are working well. They only need to pay attention to the monitor when an alert arises.

The messaging system includes an identifier field that each microservice adds a number to. This identifier gives a trace through the microservice hierarchy and enables monitoring tools to identify the structure that serves an app. Really good monitors produce live maps of this process hierarchy that really exposes the complexity of all the interactions that are going on.

Alerts, graphs, and analysis tools make the supervising microservice activity a lot easier. The best distributed tracing monitors cut through all of the clutter of a mass of messages to present a clear live picture of microservice activity.