Best Distributed Tracing Tools

Microservices architecture offloads processing requirements from apps. This is a necessary step when catering to mobile devices where processing and storage space are at a premium. Microservices are accessed through REST APIs.

Each microservice performs a service, such as managing database access, membership services, invoicing functions, and so on. However, it is common for a microservice to employ others in order to complete a task. So, there can be many layers of microservices contributing to a successful transaction.

Here is our list of the best distributed tracing tools:

  1. Datadog APM EDITOR’S CHOICE A general application performance monitoring service that has a specialized microservices monitoring feature with extra data gathering services. Start a 14-day free trial.
  2. Site24x7 APM A package of application monitors that will monitor the functions operating on a server or the applications contributing to a website or app. This monitoring system includes a section for monitoring each app’s performance that illustrates the response times of all the microservices backing it.
  3. New Relic Telemetry Data Platform A thorough microservices monitoring system that is based on distributed tracing. New Relic adds in its own agents to supplement the standard distributed tracing messages with its own, more detailed statistics.
  4. Lightstep A microservices monitoring tool that was written by one of the designers of Google’s in-house distributed tracing platform.
  5. Dynatrace An AI-based application performance monitor that specialized in Clod and Web-based systems and has distributed tracing management built-in.

Abstraction

Every system designer and programmer knows about abstraction. It is a black-box strategy that means you can split off useful pieces of code so that they can be reused again and again and optimize the investment that goes into building software.

Once a piece of code can complete a task successfully, it can be stored individually and represented by its declaration. All anyone needs to know is what it does, what inputs it requires, and what type of data it returns. After that, it is like a command that can be plugged into any other program. Software houses build up libraries of functions and when they make those available to other developers, they are called APIs.

APIs take care of tasks without having to reinvent the wheel with every new piece of software. However, the big black-box advantage is also a curse. Functions that are made available for sale or rent can’t be read. They are usually even hosted by the software house that developed them. So, when a new app gets written with functionality provided by APIs the processing is performed God-knows-where.

That’s great for getting things done, but it is a nightmare for performance monitoring. Monitoring tools need access rights and third-party API providers are not going to let their customers run analyzers through their code.

All a performance monitor can track is the start and finish times of an API and the results they returned. They can’t see into the operations that the API’s backend code performs or see whether the microservice includes APIs to other microservices. It is very common for microservices to be built on many layers and each microservice could well be run on a different server located in a different part of the world.

Distributed tracing

Distributed tracing is an industry method to allow developers to monitor the performance of the APIs that they use without actually being able to analyze the backing microservice’s code. There are many protocols available for distributed tracing, which complicates a service that is intended to simplify a complicated problem.

The IT industry is solidifying around a few distributed tracing open standards.

  • OpenTracing is a product of the Cloud Native Computing Foundation (CNCF).
  • OpenCensus is a Google product, based on its own in-house distributed tracing service, called Dapper.
  • OpenTelemetry is a merger of OpenTracing and OpenCensus that is still under development. Managed by the CNCF.

These are the main distributed tracing standards. However, there are others.

There are a number of independent, free-to-use distributed tracing platforms available. Among these are Kafka, managed by the Apache Software Foundation, Jaeger, with libraries of tracing functions for C#, Java, Node.js, Python, and Go; and Zipkin (OpenZipkin) which has libraries for more languages, including Java, Javascript, C, C++, C#, Python, Scala, and Go.

The variety of available standards makes it difficult to track all microservices because the service being traced needs to post tracing messages according to one standard or another. If you have a distributed tracing tool that is based on Jaeger, you will miss out on status messages generated for Zipkin. Complicating matters further, AWS has its own proprietary system, called X-Ray in order to monitor its Lambda microservices platform.

What is telemetry?

A lot of distributed tracing systems producers use the word “telemetry” in their names and descriptions. “Telemetry” is not a word that was invented for IT – it exists in other areas of life. For example, you will see signs for the Telemetry Department in hospitals. The term comes from two Greek words, “tele,” which means “remote” and “metron,” which means “to measure.” In healthcare, a telemetry unit is a mobile heart monitor. In IT, the term could refer to any remote monitoring system but it has become specifically associated with distributed tracing for the monitoring of microservices.

A characteristic of telemetry is that it uses a parallel channel for status reporting. That means it doesn’t run through the code itself, but works alongside a running process and gathers statistics independently.

The usefulness of that strategy lies in the fact that, ordinarily, a program that falls over doesn’t get to the line of code that says “report a major problem.” Similarly, a process that is still alive but waiting for a resource (hanging) is stuck at a particular line of the program and can’t get out to say “I’m trapped here.” To account for that situation, the program includes routines to report “I’m still working fine.” So, when the monitor stops receiving those messages, it knows that something has gone wrong. A telemetry-based system is able to continue operating even when the process it monitors is in trouble.

See also: Distributed Tracing Guide

Distributed tracing tools

The best-distributed tracing tools are able to detect and interpret messages written to a number of common microservice status reporting standards. There are three types of distributed tracing tools:

  • Trace message collectors
  • Trace message consolidators
  • Distributed tracing monitors

The task of collecting distributed tracing messages is a specialized service. It needs a tool that knows where to look for messages and can recognize their formats.

Trace message consolidation and storage is a service that could be implemented with many general-purpose log file managers, such as Splunk. The producers of distributed tracing tools don’t like the tracking system to be referred to as logs. However, this might be because they don’t want to have to compete with the much larger field of logfile managers.

Distributed tracing monitors are the highest form of distributed tracing tools because they produce a full information service for managing applications based on microservices.

Microservices architecture offloads processing requirements from apps. This is a necessary step when catering to mobile devices where processing and storage space are at a premium. Microservices are accessed through REST APIs.

Each microservice performs a service, such as managing database access, membership services, invoicing functions, and so on. However, it is common for a microservice to employ others in order to complete a task. So, there can be many layers of microservices contributing to a successful transaction.

Distributed tracing monitors

Distributed tracing monitors are the ultimate tools for distributed tracing because they include all three elements needed to track the performance of microservices. They will collect, manage, and interpret distributed tracing messages, presenting live statuses. They also store records in a meaningful file structure or in a database so that they can be easily accessed for historical analysis. A good distributed tracing monitor will also include a message viewer and interpreter for root cause analysis.

One problem with distributed tracing is that microservices can generate a lot of distributed tracing messages – most of them are just progress records, logging the time each service starts working for a particular session and “keep-alive” type message to let the monitor know that the service is still processing. A good distributed tracing tool will filter out these workaday notifications, or interpret them as graphs.

It is possible to use a logfile manager and set up search scripts to filter and group messages. However, this is a lot of work and it is better to let an automated monitoring tool do the work for you.

The best distributed tracing monitoring tools

The producers of the best distributed monitoring systems have been able to exploit many of the built-in features of the trace message format to produce very comprehensive and attractive monitoring services.

Our methodology for selecting a distributed tracing system

We reviewed the market for distributed tracing tools and analyzed the solutions based on the following criteria:

  • The ability to interact with serverless providers
  • Application dependency mapping
  • OpenTelemetry, OpenTracing, or OpenCensus compatability
  • Data retention for historical analysis
  • Alerts and notifications for performance problems
  • A free trial or a demo system that offers an obligation-free testing opportunity
  • Value for money from a monitoring tool that competently tracks microservices at a fair price

Using this set of criteria, we looked for distributed tracing systems that can provide full activity and dependency maps for microservices.

1. Datadog APM (FREE TRIAL)

Datadog APM

Datadog APM is a very similar service to that offered by New Relic. This is a cloud-based application performance monitor that includes many types of source data, including distributed tracing messages.

Key Features:

  • OpenTracing and OpenTelemetry
  • Interfaces to serverless platforms
  • Application dependency mapping
  • Performance alerts

Why do we recommend it?

The Datadog APM has evolved from its original function of monitoring all applications to being specifically geared towards monitoring Web applications and microservices – all other application monitoring has been shifted into the Infrastructure Monitoring unit. Distributed tracing is central to the operations of this module.

Datadog APM is able to collect and process OpenTracing and OpenTelemetry messages. The service files these messages along with other indicators. Datadog APM also collects its own statistics on microservice performance and application processing environments with agents, in a similar way to the method used by New Relic. In addition to monitoring standard distributed tracing messages, Datadog APM is able to interface with a range of AWS services, including the Lambda microservices platform.

A great feature of Datadog APM is the way it processes the data that it collects. It is able to generate visual representations that show the connections between microservices operating live in a hierarchy to fulfill requests from a given application. This is called the Service Maps system and it is very impressive. The Service Maps act as an index to all operating microservices that can be zoomed in on for greater detail.

Who is it recommended for?

This package is suitable for use by businesses that use serverless functions or APIs that are provided by others. If you are monitoring your own systems, you don’t need distributed tracing so much because you can insert log message generation to provide performance metrics. The highest plan gives you a code profiler as well.

Pros:

  • Offer a simple way to collect and process OpenTracing and OpenTelemtry messages
  • Supports monitoring across various environments – including microservices
  • Highly customizable dashboards, great for NOC teams
  • Cloud-based monitoring, can be accessed from anywhere
  • 400+ integrations can support nearly any deployment

Cons:

  • Would like to see a longer trial period

Datadog APM is a subscription service and you can get it on a 14-day free trial.

EDITOR'S CHOICE

Datadog APM is our first choice as it offers end-to-end distributed tracing for seamless front-end to back-end data monitoring. A fully scalable solution with code-level visibility.

Official Site: datadoghq.com/product/apm/

OS: Cloud-based

2. Site24x7 APM

Site24x7 APM

Site24x7 APM is a cloud-based service that is ideally suited to remote monitoring techniques, including distributed tracing for microservices. The company doesn’t reveal which protocols it is capable of interpreting. However, it is able to monitor services written in Java, .NET, and Node.js.

Key Features:

  • Includes mobile app performance
  • Application dependency mapping
  • Resource monitoring

Why do we recommend it?

Site24x7 APM is similar to the Datadog system in that it focuses on Web applications and leaves all other applications to its Infrastructure plan. As well as distributed tracing, you get real-user monitoring and statistics on the activities of cloud platforms, containers, virtualizations, and networks.

The APM is structured to monitor applications running on a server, a website’s activities, and apps for mobile devices. The standard APM package will track the activities of three applications that depend on microservices. This number can be bumped up with add-on fees. The same package also gives you the capabilities to monitor 40 websites and servers in any combination – for example, 20 servers and 20 sites or 20 sites and 10 servers, and so on.

The APM Insights service has a distributed tracing section that gives an overall view of an application’s performance. This includes a graphic representation of the total processing time of the application segmented by the time taken by each contributing microservice. You can drill-down to observe the performance of each individual microservice and then get further in to see each individual message that service generated.

Who is it recommended for?

Site24x7 provides a bundle of services with its APM. These other tools require the purchase of other modules on the Datadog platform. This bundling makes the APM package very affordable, but it is still considerably more expensive than the Infrastructure plan. So, be sure that you really need distributed tracing before choosing this plan.

Pros:

  • Supports microservice monitoring via Java, .NET, and Node.js
  • Offers a host of out-of-box monitoring options and dashboard templates
  • Allows administrators to view dependencies within the application stack, good for building SLAs and optimizing uptime
  • Offers root cause analysis enhanced by AI to fix technical issues faster

Cons:

  • Site24x7 is a feature-rich platform with options that extended beyond distributed tracing tools, may require time to learn all options and features

Site24x7 APM is a subscription service and you can access it on a 30-day free trial.

Site24x7 APM is a great a distributed tracing tool because it includes a wide range of performance monitoring and lag management utilities as well as microservice tracking. The combination of all of the monitoring techniques deployed by the APM means that it is a great tool for supervising apps written for Android and iOS while also examining how those same microservices perform when called from websites.

Get a 30-day free trial: site24x7.com/signup.html

Operating system: Cloud-based

3. New Relic Telemetry Data Platform

New Relic Insights dashboard

The New Relic Telemetry Data Platform is geared towards developers of apps as well as businesses that want to monitor their microservices infrastructure. The company stresses the affordability of its storage space, which shows that it understands that this type of service involves sorting through masses of data.

Key Features:

  • OpenTelemetry, OpenTracing, OpenCensus, and Zipkin
  • AWS and Azure
  • Adds on-site resource mapping

Why do we recommend it?

The New Relic Telemetry Data Platform is a leader in the field of distributed tracing. This company pioneered the concept of the APM and their move to focus on monitoring serverless systems with distributed tracing was followed by everyone else in the industry. Monitored subjects can be hosted on AWS or Azure.

New Relic centralizes data collection from many sources, including distributed tracing messages generated through OpenTelemetry, OpenTracing, OpenCensus, and Zipkin. Other data sources for the monitoring system include log files from applications and infrastructure devices plus a long list of AWS services, such as Lambda, and Azure, Apache, and operating system status reports.

The New Relic application monitoring service deploys its own agents to add extra insights into web and app performance, driven by microservice actions. These include browser monitors and connection testers.

Who is it recommended for?

There are several different standards for distributed tracing and they are not compatible. If you are monitoring third-party systems and hope to use distributed tracing, you would be wasting your money if your monitoring package is using a different standard to the functions that you want to track. New Relic covers all known telemetry standards.

Pros:

  • Designed for developers and technical users – offering lots of customization
  • Supports distributed tracing via OpenTelemetry, OpenTracing, OpenCensus, and Zipkin
  • Can monitor other systems including cloud, on-premise, and hybrid environments
  • Supports a free tier – great for in-depth trials and smaller projects

Cons:

  • Must contact sales for pricing on Pro and Enterprise plans

New Relic Data Telemetry Platform has a free tier that will process up to 100 GB of data per month.

4. Lightstep

Lightstep

The founder of Lightstep was one of the designers of Google’s Dapper in-house distributed tracing platform. Dapper was open-sourced as OpenCensus, becoming the favored distributed tracing tool of many major IT businesses, including Microsoft.  Now, OpenCensus is a part-contributor to the new, unified OpenTelemetry system.

Key Features:

  • OpenTelemetry and OpenCensus
  • Hierarchy visualization
  • Free version

Why do we recommend it?

Lightstep is a competent distributed tracing service and it is one of the best of its kind. However, while rival systems, such as Datadog and Site24x7 also offer opportunities for monitoring networks and other applications, Lightstep only monitors Web applications. It provides the two most widely used protocols for distributed tracing.

Lightstep bases all of its monitoring work on distributed tracing, so it won’t integrate with monitors for other processing activities. However, Lightspeed does microservices monitoring very well and that’s why it gets onto our list of the very best distributed tracing tools.

The graphical representations used by Lightstep to interpret distributed tracing data is the key advantage of this tool. It displays the hierarchy of operating microservices serving an application, which is called the Operations Diagram. This leads to microservice performance data that also shows resource usage graphs and response time visuals.

Who is it recommended for?

Lightstep is a specialized distributed tracing tool, so it won’t integrate with your entire monitoring system. Therefore, you would need to be very intensively using third-party APIs and functions to justify using this tool. This would be suitable for developers and providers that rely on third-party components.

Pros:

  • Provides a simple yet informative look at your distributed traces
  • Sysadmins can easily sort through traces and filter them by operation, date, and source
  • Can display microservice dependencies for each application
  • Offers a free Community Edition

Cons:

  • Could benefit from a longer 30-day trial
  • Focuses solely on monitoring distributed traces – doesn’t support network/infrastructure monitoring

Lightstep is a subscription service that is delivered from the Cloud. It is available in three editions: Community, Pro, and Enterprise. The Community edition is free to use and the Pro version is available on a 14-day free trial.

5. Dynatrace

Dynatrace

Dynatrace is an AI-driven application performance monitor that is delivered from the Cloud. While covering many methods for status detection, Dynatrace applies machine learning and heuristics to identify important information from the large amounts of data that most reporting and logging systems generate.

Key Features:

  • OpenTracing
  • Development support
  • AI-based

Why do we recommend it?

Dynatrace provides Infrastructure Monitoring and a higher plan that gives you an APM as well. This APM relies on distributed tracing for Web application monitoring and for all other assets, you would rely on the Infrastructure system – the combined package is called Full Stack Monitoring.

Among the data sources for the Dynatrace system is the distributed tracing open standard OpenTracing. Dynatrace collects and processes activities that contribute to a given application. It then tracks back, by interpreting distributed tracing messages to identify all of the microservices that worked on a session for that application.

The support that Dynatrace gives to system managers and developers enables new microservices to be written efficiently and tested. Systems managers can let Dynatrace monitor application performance because it will raise an alert if it identifies a problem with a microservice.

Who is it recommended for?

Dynatrace is a multipurpose system that can monitor on-premises systems and cloud services, so you would be able to use it for all of your monitoring needs, not just Web application tracking. Its distributed tracing function uses OpenTracing and its successor OpenTelementry, so if the functions you tracing integrate some other standard, you will be in trouble.

Pros:

  • Tracks distributed traces using OpenTracing
  • Highly visual and customizable dashboards, excellent for enterprise NOCs
  • Operates in the cloud, allowing it to be platform-independent
  • Can monitor application uptime as well as the supporting infrastructure and user experience

Cons:

  • Designed specifically for large networks, smaller organizations may find the product overwhelming

You can experience the Dynatrace application monitor on a 15-day free trial.

Choosing a distributed tracing monitoring tool

All distributed tracing standards include a priority field in the message structure. This is a great help because it means that distributed tracing monitors can quickly categorize the statuses of all microservices working on a session and generate alerts when high-priority messages come through. Alert-based systems allow technical staff to get on with other tasks, assuming that all microservice operations are working well. They only need to pay attention to the monitor when an alert arises.

The messaging system includes an identifier field that each microservice adds a number to. This identifier gives a trace through the microservice hierarchy and enables monitoring tools to identify the structure that serves an app. Really good monitors produce live maps of this process hierarchy that really exposes the complexity of all the interactions that are going on.

Alerts, graphs, and analysis tools make the supervising microservice activity a lot easier. The best-distributed tracing monitors cut through all of the clutter of a mass of messages to present a clear live picture of microservice activity.

Distributed tracing tools FAQs

What distributed tracing tools?

A distributed tracing tool is an application monitoring package that is able to run alongside microservices and record performance data. Microservices are hosted on “serverless” cloud accounts. These do not leave any space on which to install the monitoring software or even a data collection agent. Thus, traditional monitoring services can’t be used because they rely on being able to run on the same server in order to gather statistics. Instead, distributed tracing uses a method called telemetry.

What is distributed tracing?

Distributed tracing is the practice of identifying all of the modules that contribute to the operations of a Web application or mobile app. Once a frontend application is run, it will call in many other backend functions, which might be hosted elsewhere. The distributed tracing system has to follow that execution hierarchy in order to identify the precise source of performance errors or security weaknesses.

What is tracing DevOps?

DevOps environments merge the development team and operations team into one group. These people both create and manage applications and this strategy is particularly applied to the provision of Web applications and mobile apps. During development, tracing is used to verify APIs, frameworks, and function libraries that are going to be used to build the application, then it is used during system testing. When the application goes live, operations monitoring requires tracing to ensure the efficient and secure execution of the application.