Working with complex and distributed technologies in modern software environments can be challenging. Our Complexity in Context video series is designed to give you a clear, hands-on understanding of how to use New Relic to cut through that complexity.
This inaugural episode delves into how observability has changed as modern software has evolved from monolith applications to microservices certification-based environments. We also walk through an example of how to use New Relic to troubleshoot performance issues in microservices certification environments in four steps:
- Get an overall view of application performance
- Use distributed tracing to drill down into performance issues
- Use benchmarking to identify anomalous spans
- Gather detailed data to fix performance problems
You can read through the steps, and then watch the entire 4-minute video embedded at the end of this post.
The evolution of observability
To begin, it’s helpful to review the key differences between a modern, microservices certification application architecture and a legacy monolithic application—and especially how these differences impact the way you identify and troubleshoot application performance issues.
When working with a monolithic application, it’s important to look deep into a system, down to the code level. That’s the only way for developers to understand the application’s internals.
In contrast, with a microservices certification environment, you may be dealing with dozens or even hundreds of services. There are countless ways for these services to call one another and for requests to flow through multiple services in these environments. Seeing how those services connect together and how your requests flow through a specific combination of services, as illustrated in the following image, is a lot more important in microservices certification environments:
Understanding how requests flow through these complex, distributed environments can be challenging. But you need to untangle this complexity in order to diagnose and fix performance issues quickly and accurately.
In addition, in many cases, different teams own and maintain different services within a distributed application. Unless you know which services are involved in a performance issue, it can be very difficult to identify which team should take the lead in fixing it.
New Relic can help cut through this complexity, so you can understand the relationships between services and how they’re performing, and uncover the source of performance issues.
Let’s look at exactly how it’s done.
Step One: Get an overall view of application performance
For this example, start with a hypothetical scenario: You’ve just joined the team working on a web portal service at a telecom company. The support team informs you that customers are complaining about intermittent slowness on the web portal service—specifically, on the page where customers select a mobile phone for their service.
The first step is to look at the big picture for how the web portal service works with all the other services in the environment. The New Relic APM service map shows you an comprehensive topology for how everything fits together:
The service shows that the web portal service connects to a number of different services, including a login service and fulfillment service. It also reveals a polyglot environment: The web portal service is in Java, for example, while the login service is in Go, and the fulfillment service is in Ruby.
In addition, selecting any of the services visible on the service map presents a quick overview of its response time, Apdex, and other performance metrics:
Once you understand which services impact the web portal service and the connections between these services, your next step is to drill down into the environment and begin your search for the problem that’s behind the customer complaints.
Step Two: Use distributed tracing to drill down into performance issues
Begin the troubleshooting process by looking at some of the requests that go through the “phones” page, and find a request that looks unusually slow.
First, use New Relic APM to see a list of all the transactions flowing through this page:
In general, this data indicates that the “phones” page on the web portal is performing well, but it’s worth taking a closer look and reviewing some traces that may reveal slow requests that require troubleshooting:
Click into the distributed traces for this page, and you’ll see that it has already filtered the traces to show just the set that’s relevant to the “phones” page on the web portal:
Bingo! You’ve found some unusually slow requests. Select one of them, as shown here, to look into what might be happening:
When you get a detailed view of the selected request, as shown below, you can see that the request started off in the web portal service, then it went to the promo service and finally to inventory service. Note the top right corner of this screen shot, where New Relic APM has flagged this…