Skip to content
Microservices Certification

A Faster, Easier Way to Troubleshoot Microservices Environments


As microservices certification environments continue to grow in size and complexity, they create bigger challenges for DevOps teams working to monitor the health and behavior of their systems. New Relic recently announced the general availability of a new feature that addresses this challenge: anomaly detection for distributed tracing.

Anomaly detection automatically surfaces the anomalous parts of a trace, which allows customers to find and focus directly on sources of latency. It’s one in a series of New Relic releases devoted to distributed tracing improvements, including condensed trace views, deployment markers, and trace groupings—all of which help DevOps teams understand and troubleshoot microservices certification environments more quickly.

The challenges of using trace data to resolve performance bottlenecks

Distributed tracing helps you understand how requests flow through your microservices certification environment. This capability makes distributed tracing useful for finding sources of latency and errors within these environments. By tracing a request as it goes from one service to another, and timing the duration of important operations within each service, you get a complete picture of any performance issues or bottlenecks that impact the request within a distributed system.

However, digging through all of this trace data to find actionable insights can be time-consuming—requiring you to filter down to relevant traces that will actually help you spot the problem. As distributed systems continue to scale and get more complex, this process of finding the relevant traces and spans required to solve a performance problem will just keep getting harder.

In addition, even after you find a relevant trace, it’s often difficult to understand the flow of the trace—and it can be impossible to tell whether a specific span within the trace was performing normally. To make matters worse, many distributed tracing solutions provide only high-level, service-to-service details. These solutions fail to drill down into the details that indicate what’s actually happening inside the service—yet these highly specific, in-process details are precisely what a developer needs to see when the source of a performance problem lies within the service itself.

Highlighting performance bottlenecks with applied intelligence

New Relic has introduced anomaly detection to automatically highlight unusually slow spans within a trace, making it easier for you to find and focus on these anomalous sources of latency. New Relic Applied Intelligence—a set of services that includes artificial intelligence, machine learning, and advanced statistical analysis—plays a key role in this capability by making connections and uncovering actionable insights within even the biggest anomaly detection data sets.

When New Relic views a trace, it uses anomaly detection to compare the spans within that trace to the spans of other, similar traces. It then highlights spans with longer-than-normal latency times based upon this comparative analysis. For each anomalous span, New Relic displays a summary that shows why it flagged the span as anomalous. New Relic also creates histogram charts showing the duration distribution of similar spans; these also show how the anomalous span compares to its peers over the past six hours. The latter capability is particularly useful for understanding how much of an outlier a particular span represents.

These capabilities also help teams in microservices certification environments where they often look at traces that involve their own services, but they may also look at traces involving services that they don’t own and don’t understand deeply. For these teams, it’s often difficult to understand what’s “normal” in a trace. With New Relic’s anomaly detection, these teams now get a fast and highly accurate sense of what’s normal in a given situation versus what’s unusual and needs attention.

Anomalous spans flagged for review with performance benchmarking against similar spans.

Anomalous spans flagged for review with performance benchmarking against similar spans.

Faster root-cause identification with improved trace navigation and context

We’ve also released a number of improvements that make it easier to to pinpoint underlying problems using trace data.

Trace groupings: Reproducing issues often requires identifying patterns: Are traces that originate from a specific service slower than others? Is a specific entry span slower or faster than others? Are errors clustered during a specific time period? Trace grouping allows you to organize traces by errors, root service, root entry span, and service entry span giving you more ways to zero in on traces that are relevant to the problem you’re trying to solve.

Deployment markers: Deployments and other changes often cause performance problems. In these cases, correlating a difference in trace behavior with a particular deployment to an upstream or downstream service lets you identify the offending service more quickly. Embedded deployment markers provide this…