API Monitoring: What should you measure?
When it comes to monitoring third-party APIs and web services, what you monitor is as important as how you monitor. Data is useful, but actionable data is where the true value is. Below we've listed the most common and valuable metrics to monitor when relying on third-party API integrations and web services. Accurate monitoring and alerting can provide your business with the data it needs to make decisions about which APIs to use, how to build resilient applications, and where to focus your engineering efforts.
Here are the metrics that we recommend when you start to monitor an API or web services:
Latency is the time the message spends “on the wire”. Here, the shorter the number, the better. Latency can be caused by the connection between your server and the API server. It can also be caused by delays that occur between your server and the API server. This may be the result of network traffic or resource overload—where throttling the requests might accommodate the heavy load.
To monitor latency, the web service needs to track timestamps for the outgoing and incoming requests and compare them to past and future requests over a given time. This can still be tricky, as the responses from the server will also be affected by response time. If available, pinging an endpoint or calling a health-check endpoint can be the best way to receive an accurate latency estimate.
This evaluation can be useful when positioning servers geographically. By determining the lowest latency your business can make decisions as to which provider to select. You can also select specific regional provider services if it is determined that the latency is the true cause for delayed responses, or select different providers if the response time of their resources is the problem. In actual practice, latency and response time will often be combined as a single value.
Where latency takes into account the delays in the network itself, response time is the time it takes a service to respond to a request. This can be harder to track with third-party APIs and web services, as the latency to send and receive data is part of the response time. You can estimate response time by comparing the response time across multiple resources on a given API. From this, you can estimate the shared latency between the API's servers and your servers, and decide what the true value is.
The response time has a direct effect on your application's performance. Delays in the response of an API will result in slower interactions for your users. You can avoid this by ensuring your chosen API providers have response time guarantees, or by implementing a solution that uses a fallback API or cached resources when spikes are detected.
The availability of an API can be described as either downtime or uptime. Both are based on the same data, but can tell a different story depending on the context.
Availability is perhaps the easiest metric to keep track of. Downtime errors are recognizable and sometimes expected as an API provider will announce scheduled outages. However, even the most reliable APIs experience unforeseen downtimes. Downtime can be presented as individual events, or as an overall average across a given period. While downtime quotas and assurances such as "99.999% uptime" can be valuable when assessing an API provider, even the smallest downtime can have large impacts on your application.
Many APIs rely on external providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Services. As a result, the downtimes of an individual web service provider are also now dependent on a third-party that your application does not directly do business with. Even if the API provider's services are running as expected, the third-party may not be. As a result, when large downtimes occur, you will want to have a fallback in place that does not rely on the same underlying provider as the original API.
While similar to downtime in the way it is measured, the uptime of an API can provide insight into business decisions. If you know that one API has better uptime during key business hours for your customers, you can use this metric to move between providers.
Where some stakeholders may respond to downtime when selecting which API provider to drop, others may be more likely to respond to uptime when considering which to select. These values are linked, however, they can tell a different data story.
It can be easy to forget usage, or consumption, when monitoring APIs. Internal APIs may not require a usage metric, but telemetry into third-party API consumption can aid in making business decisions. Estimating costs when consuming a web service can be difficult without the appropriate data. Consumption can be evaluated as a whole, or in bursts. Some API providers bill on a monthly scale, but some may have rate limits on their pricing tiers that also watch for usage over a smaller time window.
By keeping track of consumption and setting alerts for high usage, you can avoid unnecessary costs. Additionally, recognizing when APIs are not being used can also be beneficial. A lack of consumption is a sign that an API is still part of your codebase, but may not be vital to your application. In this case, you can adjust feature priority and gain insight into the usage of your application.
Consumption is best viewed as a running value, and filterable by a time window. This allows dashboards to provide an overview, as well as granular details about when an API is being used
There are a variety of reasons a request will fail. When a request to a third-party API or web service fails, it may be from user error, API downtime, rate limiting, or a variety of network-related issues. While API failures can sometimes be caused by your application, when it comes to tracking third-party APIs you want to focus primarily on failure rates out of your control.
Tracking failures and determining failure rates can aid in:
- Reporting problems to the API provider
- Deciding between multiple API providers
- Making informed decisions related to fallback scenarios
- Building resiliency around certain resources
Some errors may come from invalid requests. These can tell you that your application needs to adjust internal validation before making a request. Errors that come from server-related issues, like status codes in the 400 and 500 ranges, are a sign that the problem is likely with the API or web service provider.
Tracking HTTP responses can give you granular details about an individual API, but tracking specific status codes can give you better insight into the type of problems. For example, some API providers will respond with a
200 OK status, even when an error occurred. This false metric may lead you to believe that everything is working as expected, but users may experience problems and your internal logging may tell a different story.
Comparing status code metrics from API providers with internal error logs can provide additional insight into the true error rates of the third-party web services your application relies on.
With these metrics in mind, your applications can better handle the unavoidable issues that will arise when relying on third-party integrations.
Measuring all these metrics may sound like a daunting task. Fortunately, some developer tools, like Bearer, can aid in both monitoring many of these metrics and reacting to problems that arise automatically.