AIOps platforms combine IT operations tools with advanced data management systems.
Not every business is the same and so off-the-shelf system management tools often need to be tailored before they work properly, Artificial Intelligence can make those adjustments automatically, making it easier for monitoring and management suppliers to deliver useful services.
Here is our list of the five best AIOps platforms:
- Datadog APM A package of IT system monitoring and management tools that include AI-driven system mapping and root cause analysis modules. This is a cloud-based service.
- LogicMonitor A cloud-based monitoring system that maps all hardware and logs all software, creating dependency maps to facilitate AI-based root cause analysis.
- Dynatrace This cloud-based AI-driven monitoring platform is particularly strong at identifying the services that underpin websites and Web services.
- AppDynamics A cloud-based monitoring platform that tracks activity within a system in order to spot the ripple effect of one problem impacting other system areas.
- New Relic One This platform deploys AI to create a system monitoring service that is particularly adept at viewing the application stack. This is a cloud-based service.
AI techniques are particularly useful in two areas of IT operations:
- Systems management
- Root cause analysis
The interconnection between different IT services can be difficult to track. The knock-on effect of an issue in one area of operations can only become apparent when performance problems arise in a different part of the system.
System management and root cause analysis present the same issues from different perspectives. System management involves spotting system issues when they first arise, preventing them from propagating through to performance problems. Root cause analysis starts at the other end of the pipeline, beginning with a performance problem and chaining through the app stack to identify the true issue.
AIOps for system management not only prevents problems from arising, but it maps systems from software through to hardware components. This lays down paths to investigation. Properly documented systems are much easier to analyze. System mapping services that form part of an AIOps platform speed up problem resolution because the root cause analysis module already has system information available. Thus, it can cut out that first phase of the investigation.
Root cause analysis
Spotting the real cause of a problem can be a hopelessly time-consuming task in modern systems that include so many layers of services. Usually, the component that experiences a problem isn’t necessarily going to be the cause of the problem.
As they are user-facing, software packages will always provoke more complaints about poor performance than underlying services. So, initial reports of performance problems are only starting points for investigation.
Drilling through manually to the real cause of problems is difficult because it requires the full range of system management skills. Troubleshooting ends up being a team effort, requiring skills from many different technical specialists. Unfortunately, highly skilled staff are in short supply and are also very highly paid.
AIOps platforms distill expertise and store possible solutions to a myriad of problems. The fundamental structure of AI programs is a method of heuristics, based on probability. Essentially, the very structure of AI involves exploring decision trees – if A is happening, it could be X (50 percent), Y (40 percent), Z (5 percent), as yet unknown (5 percent). That method of operation maps exactly the question-and-answer format that IT operations technicians go through when investigating problems.
The “as yet unknown” option is a key characteristic of AI systems because it accounts for all situations. The system isn’t useless if it encounters a problem that has never arisen before. It just identifies an area that requires further research. If a system can provide a solution nine times out of ten, it has saved a lot of time and effort. Those unusual events will need human intervention but the results of that investigation can be fed into the AI system so that human involvement won’t be needed if that rare occurrence happens again.
So, AIOps platforms are very beneficial to incident management scenarios because they provide the knowledge base of many specialists to one operator who doesn’t necessarily need to have any technical knowledge in order to solve problems. They also minimize the involvement of specialists down to rare incidents and record the solution for next time.
Taking that provision of expertise one step further, it isn’t even necessary to have that operator there to use an AI-guided analysis service. Proactive monitoring tools, working constantly, can get triggered automatically by performance thresholds and produce recommendations for technician intervention where required. Thus, system management and root cause analysis features in an AIOps platform work together to save time and money.
Characteristics of AIOps platforms
The term “platform” implies more than just a piece of software. A platform is a suite of tools that interact with each other. In many ways, a platform is much like an operating system because many of those tools act as services that aren’t directly accessible. Other tools are interfaces, that select services according to user requests and actions.
An AIOps platform has AI processes running all the way through its stack. An AI-driven interface interacts with AI-based services to provide a solution. Some AIOps platforms provide APIs and plugins for other systems, so the user doesn’t necessarily need to log in to the platform in order to benefit from its services.
There are a number of characteristics that identify AIOps platforms:
- A suite of tools
- Machine learning capabilities
- Shortcuts to processing large volumes of data
- Stored solutions
- Data access interface
- Decision trees
- Natural language querying
- Interfaces to third-party systems
The best AIOps platforms
There is now a range of delivery options available for AIOps platforms and their capabilities are varied. There are many cloud platforms available now, so you don’t necessarily need to wonder about which system will run on the operating system of the servers that you have on your site. In fact, if you run an entire virtual system, you might not have your own servers on-site.
AIOps systems also need to be able to explore Cloud-based resources. You might have wireless networks on-site and you might also need to include performance monitoring for more remote sites or devices used by remote workers.
These services are all very similar and so you will need to try them out for yourself in order to decide which is right for your system. We have narrowed down the candidates to a very short list in order to save you assessment time. You can read more about each of these AIOps platforms in the following sections.
Datadog is a cloud-based system monitoring and management platform that has AI processes threaded through it with its Watchdog module. Watchdog operates both as a system monitoring assistant and a root cause analysis tool. In order to fully benefit from the AI services of Datadog, you could add in other plans, such as infrastructure, which monitors networks and servers.
The APM plan of Datadog includes application, cloud services, and website performance monitoring. The AI functions of the Datadog service apply to all of these individual systems. It links together front-end interface performance to app stack services, down through server services and hardware capacity through to network device performance and traffic patterns.
The Watchdog system is able to thread together application and service dependencies, creating a service map that prepares the monitoring service for automated root cause analysis when issues arise. The artificial intelligence service informs performance monitoring by applying machine learning to baselining, anomaly, and outlier detection in ongoing system monitoring. The application stack map is also available for viewing to support manual system exploration.
The full complement of Datadog services adds up to an AIOps platform. The system includes a performance threshold service with alerts that identify potential problems. These thresholds apply to all components of an IT system and they can be set to trigger notifications. Those notifications can be sent by email, SMS, or Slack post.
Datadog APM is a subscription package with a rate per month or per year. You can get the APM on a 14-day free trial.
The Datadog APM package is our top pick for an AIOps platform because it includes monitoring and management tools for websites and applications that can drill down to the supporting server. The Watchdog element in the platform provides constant performance anomaly detection based on machine learning. This provides AI processes for system maintenance. Watchdog is able to perform root-cause analysis on application performance, relying on system maps laid down by the constant system monitoring services of other Datadog modules. Datadog is able to monitor cloud services and integrate system management for multiple sites.
Get a 14-day free trial: datadoghq.com/free-datadog-trial/
Operating system: Cloud-based.
LogicMonitor operates from the cloud and it can monitor your on-premises, remote, and cloud-based infrastructure. The aim of LogicMonitor is to provide as much process automation as possible and it deploys AI in its operations management services to achieve this aim. This makes LogicMonitor an AIOps platform.
LogicMonitor is packaged as Core monitoring and Website monitoring services. The Core platform is available in two editions: Pro and Enterprise. The AI features of LogicMonitor are included in the Enterprise plan.
The Website monitoring package is centered on on-site testing tools. The Core package includes Website monitoring services, so don’t think that the Enterprise plan doesn’t include systems to help you manage Web services and websites. The Enterprise edition also has Cloud monitoring capabilities. The platform includes network and server monitoring, traffic analysis, storage device monitors, and application monitoring services.
The AIOps Early Warning System is one of the AI-driven advantages that the Enterprise plan has over the Pro edition. This is an anomaly detection service that adjusts its behavior baseline with a machine learning process. The anomaly detection that operates on top of this baseline also includes AI services.
LogicMonitor provides device discovery, application dependency mapping, root cause analysis, and AI-adjusted baselining. This is a thorough AIOps platform for day-to-day system management, demand forecasting, capacity planning, and incidence response. You can check it out for yourself on a 14-day free trial.
Dynatrace is a cloud-based platform that offers infrastructure and application monitoring for on-premises and cloud infrastructure. This service is an AIOps platform that includes application security, performance testing, and business analytics tools as well as everyday system monitoring. The tool uses AI processes to improve anomaly detection and mean time to respond when problems arise.
The Dynatrace platform includes an AI engine, called Davis. This service starts by mapping all hardware and software, creating a dependency topology map for all resources. This readies the system for instant root cause analysis in the case of performance issues arising. The performance tracking capabilities of this service do not segment your system by location, so all resources, no matter where they are, are integrated into a hybrid landscape.
Davis watches for anomalies in performance and system usage, spotting shortfalls in capacity and warning of issues before they arise. The system discovery process functions constantly and automatically, so if a new service is added, the dependency map gets updated. This tracking extends through APIs to microservices, so capacity issues can be headed off, even when they are caused by third-party processes.
The Dynatrace service also extends to log management and user experience monitoring. Its services are useful for DevOps environments because it can assist developers and testers in working out whether new code is faulty or just badly served by supporting modules and infrastructure. Operations staff can continue to benefit from Dynatrace AI-based monitoring once that new code goes live.
As a SaaS system, there is no code to deploy on-site in order to use Dynatrace. The cloud-based console for this service can be accessed through any standard browser, AI features are also active in the self-installation and automatic setup of the Dynatrace service. Dynatrace offers a free trial of its AIOps platform.
AppDynamics is a division of Cisco Systems and it organizes its IT operations support platform as a series of specialized modules which can be subscribed to in bundles. Those modules are Infrastructure Monitoring, Application Performance Monitoring, Database Monitoring, Business Performance Monitoring, and End User Monitoring.
Infrastructure Monitoring is offered as a standalone module. AppDynamics also offers a bundle of infrastructure, application, and database monitoring services called the Premium edition. The Enterprise edition includes all AppDynamics modules.
Both the Infrastructure Monitoring and Application Performance Monitoring modules map resource dependencies that prepare the operations monitoring system for root cause analysis should things go wrong. Although these monitoring services are delivered from the cloud, they will track the performance of all infrastructure whether it is on your premises or based in the Cloud.
The AI code of AppDynamics is called Cognition Engine. This adjusts performance expectation thresholds with machine learning. It recognizes that the performance of one resource is dependent on others and can calculate the minimum service requirement needed from a supporting service that is necessary to enable a service higher up the stack to meet its performance targets.
The Cognition Engie watches for bottlenecks and raises alerts when capacity issues lower down the stack are building into noticeable performance impairment.
AppDynamics is a subscription service with a rate per month for each edition. The service is available for a free trial.
New Relic One is a system monitoring service that has AI and machine learning processes built into it for performance threshold setting. This service is available in three editions and the lowest of these, called Standard, allows one user account for free. You get charged for additional users.
This is a cloud-based AIOps platform that covers infrastructure and applications and it also offers digital experience monitoring services for websites. The tool is able to track all services, even those that lie behind APIs and are hosted on third-party servers. It is able to map resource dependencies and offers constant monitoring while also providing a support system for root cause analysis.
Higher plans of the New Relic One platform add on bigger system test allowances for websites and SLA guarantees. These packages are called Pro and Enterprise. The top plan, Enterprise includes User Management features.
The console for New Relic One is hosted in the cloud and accessed through any standard Web browser. The dashboard features live performance statistics and data visualizations, which include maps of service dependencies and hardware topologies. Although the monitor does examine physical systems it is much more geared towards software monitoring. It is particularly strong at tracking the performance of virtualizations, which inhabit a world that spans both physical infrastructure and applications.
New Relic One is able to spot potential problems caused by capacity issues and back chain down the stack to identify root causes when the software fails.