How to Recover from API Downtimes and Errors
APIs are stable, until they aren’t. We talk about that often at Bearer. If you control the APIs, it gets easier, but with third-party APIs and integrations it can be more difficult to predict when an outage or incident is about to happen.
As developers, our first instinct when it comes to error handling looks something like this:
- Catch the error
- Log the error
- Display an informative message or status to the code consumer (user, service, etc.)
This handles common use-cases—especially when there is a user interface. It does not, however, handle instances where that failure can cascade. Cascading failures are those that cause multiple features, (micro)services, or parts of your infrastructure or app to fail as a result of the original failure. By using existing patterns, like circuit breakers, and by developing a strategy to handle API service interruptions, we can avoid cascading failures and keep your applications running. The focus shifts from handling errors to recovering from errors.
Identifying points of failure with monitoring
Monitoring should be the starting point when identifying failures and service interruptions. We’ve written about what to monitor in the past, and the metrics are more than just error rates. Areas to look for when assessing our API calls include performance metrics such as:
- Unexpected latency spikes
- Failure rate
- Failure status codes
It’s important to make sure our monitoring solution can cover these cases. By setting individual performance baselines for normal operation, we can put together a better view of what constitutes abnormal performance. Viewing this data over a span of time will give us insight into which APIs, or even specific endpoints, to focus on. An engineer on the team should be able to look at the trends, identify the problem, and start working toward a strategy to solve it.
APIs that are more mission-critical, and those that have a higher anomaly rate, should be the first targets of a failure recovery strategy.
Implementing a failure recovery strategy
Let’s consider a scenario where we rely on shipping address validation in our checkout process. We are using one of the various address validation APIs, such as the USPS API or SmartyStreets, to ensure that our customers enter an address we can actually deliver to. Now let’s assume that this service occasionally fails. Maybe it is down intermittently. Maybe it has spikes in response time.
What can we do about it? Start by considering options and their trade-offs:
- Is the feature necessary for our app to continue working? Can we bypass it and provide limited functionality instead?
- Can we delay the interaction and retry the request?
- Can we make use of an alternative API?
Depending on the value this API dependency provides, it may be best to allow the user to fill out their address without auto-completion and validation. In a pinch, like if we catch this for the first time in production, the best immediate option would be to remove the service. That isn’t a longterm solution.
Remediation is the act of assessing and fixing a problem to prevent further damage. This style of failure recovery and error management acknowledges that problems will happen, and takes the approach of avoiding the least amount of damage to the overall ecosystem. For an API call, this means recognizing a problem—or anomaly as we call them at Bearer—and creating a strategy to handle those types of anomalies whenever possible.
Let’s explore some of the approaches we can take to react to inconsistencies in third-party APIs.
Retry failed requests
Implementing automatic retries in a request library is fairly straightforward. This remediation approach works well within a strict set of rules. We need to consider:
- What kinds of errors warrant a retry? For example, we wouldn’t want to retry a poorly formed client request, but we would want to retry gateway errors in the event that there was a brief outage. Checking for the HTTP status code is a good place to start.
- What kind of backoff strategy would make sense for the interaction? In the address validation example, setting a long retry wait period directly impacts the user experience.
- Are the failures coming from rate-limiting? If so, we should check the headers to ensure an appropriate waiting period before retrying.
Retries are a great low-impact option for minor problems. They can be the first remediation strategy we use before moving on to a more complex or drastic option.
Custom timeouts and immediate cancellations
Another option is to implement more strict timeouts for API requests that are known to have problems that result in high latency. Some APIs have higher-than-average latency all the time, but for others, a delayed response time can be a precursor to larger problems. By limiting the timeout of a request, we can stop calls from going through before the delay begins to affect the quality of our service.
We can also choose to immediately cancel requests that we know will fail. Ideally, we would re-route these requests to a fallback API as mentioned later in this article, but when that isn’t an option immediate cancellation can be valuable. Immediately canceling requests we know will fail is a core component of basic circuit breakers.
How does intentionally causing a request to fail provide value? In our address autocompletion example, the delay could impact the user experience negatively. Either by interrupting the flow of the form. drawing the user’s attention away from their task, or by filling in information well after the user has moved on to a new portion of the page. As we mentioned at the beginning of this article, considering the best version of the user experience we can offer, even if limited, is better than a broken experience.
The best option, in terms of continuing the service uninterrupted, is to implement a fallback for the request. This could be an alternative API that provides the same or similar solution, or it could be cached data. When implementing a fallback solution, consider the following:
- What is the likelihood that the fallback API will have problems at the same time that the original API is down? Consider choosing providers that run on different infrastructures. This avoids things like AWS outage-related failures.
- Can we provide a decent approximation of the service with cached data? This doesn’t work for our address validation example, but it can for many data-source APIs.
- How long can we reasonably wait before swapping back to the original source? Having a good circuit breaker-style strategy in place can help make this decision.
Fallback APIs are a great option. They can be the second-choice API that might have been more expensive or less reliable than the primary API. If the cost of keeping them around (by usage vs. flat rate per month) is reasonable, we can write some logic to handle any format differences, and swap the requests when an anomaly with the primary API is detected. If using a circuit breaker, rather than waiting in the open state the fallback can be substituted until the waiting period ends.
While not a direct remediation strategy, notifications should be in our error handling strategy. When a configured remediation occurs, logging the event is often enough. In that case, everything failed as expected.
When the remediation itself fails, that is when a notification needs to go out to the necessary stakeholders. This notification does two things:
- It makes sure the issue gets on the radar of the development team.
- It issues notice that our remediation strategy is missing something.
How we respond to this is valuable. The entire approach to remediating anomalies is based on data and assumptions. We assume a failure will occur with one of our APIs. Then, we plan for potential workarounds that will occur when the failure occurs.
Double-down on remediations when needed
This notification tells us we can work harder on the layer of coverage we have. While the likelihood of both our primary API and our remediation strategy failing is low, there may be instances where another level of remediation is useful. In the fallback portion, we mentioned adding fallbacks as part of the “open” state. This is an example of a multi-layered remediation approach. Not only does it stop failing requests to an API before they are sent, but it also implements a second option to handle new requests.