How Rust Lets Us Monitor 30k API calls/min
At Bearer, we are a polyglot engineering team. Both in spoken languages and programming languages. Our stack is made up of services written in Node.js, Ruby, Elixir, and a handful of others in addition to all the languages our agent library supports. Like most teams, we balance using the right tool for the job with using the right tool for the time. Recently, we reached a limitation in one of our services that led us to transition that service from Node.js to Rust. This post goes into some of the details that caused the need to change languages, as well as some of the decisions we made along the way.
A bit of context
We are building a solution to help developers monitor their APIs. Every time a customer’s application calls an API, a log gets sent to us where we monitor and analyze it.
At the time of the issue, we were processing an average of 30k API calls per minute. That's a lot of API calls made across all our customers. We split the process into two key parts: Log ingestion and log processing.
We originally built the ingestion service in Node.js. It would receive the logs, communicate with an elixir service to check customer access rights, check rate limits using Redis, and then send the log to CloudWatch. There, it would trigger an event to tell our processing worker to take over.
We capture information about the API call, including the payloads (both the request and response) of every call sent from a user's application. These are currently limited to 1MB, but that is still a large amount of data to process. We send and process everything asynchronously and the goal is to make the information available to the end-user as fast as possible.
We hosted everything on AWS Fargate, a serverless management solution for Elastic Container Service (ECS), and set it to autoscale after 4000 req/min. Everything was great! Then, the invoice came 😱.
AWS invoices based on CloudWatch storage. The more you store, the more you pay.
Fortunately, we had a backup plan.
Kinesis to the rescue?
Instead of sending the logs to CloudWatch, we would use Kinesis Firehose. Kinesis Firehose is basically a Kafka equivalent provided by AWS. It allows us to deliver a data stream in a reliable way to several destinations. With very few updates to our log processing worker, we were able to ingest logs from both CloudWatch and Kinesis Firehose. With this change, daily costs would drop to about 0.6% of what they were before.
The updated service now passed the log data through Kinesis and into s3 which triggers the worker to take over with the processing task. We rolled the change out and everything was back to normal... or we thought. Soon after, we started to notice some anomalies on our monitoring dashboard.
We were Garbage Collecting, a lot. Garbage collection (GC) is a way for some languages to automatically free up memory that is no longer in use. When that happens, the program pauses. This is known as a GC pause. The more writes you make to memory, the more garbage collection needs to happen and as a result, the pause time increases. For our service, these pauses were growing high enough that they caused the servers to restart and put stress on the CPU. When this happens, it can look like the server is down—because it temporarily is—and our customers started to see 5xx errors for roughly 6% of the logs our agent was trying to ingest.
Below we can see the pause time and pause frequency of the garbage collection:
In some instances, the pause time breached 4 seconds (as shown on the left), with up to 400 pauses per minute (as shown on the right) across our instances.
With our backup plan no longer an option, we moved on to new solutions. First, we looked at those with the easiest transition path.
As mentioned earlier, we are checking the customer access rights using an Elixir service. This service is private and only accessible from within our Virtual Private Cloud (VPC). We have never experienced any scalability issues with this service and most of the logic was already there. We could simply send the logs to Kinesis from within this service and skip over the Node.js service layer. We decided it was worth a try.
We developed the missing parts and tested it. It was better, but still not great. Our benchmarks showed that there were still high levels of Garbage Collecting, and we were still returning 5xx to our users when consuming the logs. At this point, the heavy load triggered a (now resolved) issue with one of our elixir dependencies.
We considered Golang as well. It would have been a good candidate, but in the end, it is another Garbage Collected Language. While likely more efficient than our previous implementation, as we scale there is a high chance we'd run into similar problems. With these limitations in mind, we needed a better option.
Re-architecting with Rust at the core
In both our original implementation and our backup, the core issue remained the same: garbage collection. The solution was to move to a language with better memory management and no garbage collection. Enter Rust.
Rust isn't a garbage-collected language. Instead, it relies on a concept called ownership.
Ownership is Rust’s most unique feature, and it enables Rust to make memory safety guarantees without needing a garbage collector.
— The Rust Book
Ownership is the concept that often makes Rust difficult to learn and write, but also what makes it so well suited for situations like ours. Each value in Rust has a single owner variable and as a result a single point of allocation in memory. Once that variable goes out of scope the memory is immediately returned.
Since the code required to ingest the logs is quite small, we decided to give it a try. To test this we addressed the very thing that we had issues with—sending large amounts of data to Kinesis.
Our first benchmarks proved to be very successful.
From that point, we were pretty confident that Rust could be the answer and we decided to flesh out the prototype into a production-ready application.
Over the course of these experiments, rather than directly replacing the original Node.js service with Rust, we restructured much of the architecture surrounding log ingestion. The core of the new service is an Envoy proxy with the Rust application as a sidecar.
Now, when the Bearer Agent in a user's application sends log data to Bearer, it goes into the Envoy proxy. Envoy looks at the request and communicates with Redis to check things like rate limits, authorization details, and usage quotas. Next, the Rust application running alongside Envoy prepares the log data and passes it through Kinesis into an s3 bucket for storage. S3 then triggers our worker to fetch and process the data so Elastic Search can index it. At this point, our users can access the data in our dashboard.
What we found was that with fewer—and smaller—servers, we are able to process even more data without any of the earlier issues.
If we look at the latency numbers for the Node.js service, we can see peaks with an average response time nearing 1700ms.
With the Rust service implementation, the latency dropped to below 90ms, even at its highest peak, keeping the average response time below 40ms.
The original Node.js application used about 1.5GB of memory at any given time, while the CPUs ran at around 150% load. The new Rust service used about 100MB of memory and only 2.5% of CPU load.
As with most startups, we move fast. Sometimes the best solution at the time isn't the best solution forever. This was the case with Node.js. It allowed us to move forward, but as we grew we also outgrew it. As we started to handle more and more requests, we needed to make our infrastructure evolve to address the new requirements. While this process started with a fix that merely replaced Node.js with Rust, it led to a rethinking of our log ingestion service as a whole.
We still use a variety of languages throughout our stack, including Node.js, but will now consider Rust for new services where it makes sense.