|
I recently helped a client launch an AI code reviewer called Evolua [1]. Evolua is built entirely with serverless technologies and leverages Bedrock. Through Bedrock, we can access the Claude models and take advantage of its cross-region inference support, among other things. In this post, I want to share some lessons from building Evolua and offer a high level overview of our system. But first, here’s some context on what we’ve built. Here [2] is a quick demo of Evolua: ArchitectureThis is a high-level view of our architecture. Here are some noteworthy points:
Why Bedrock?We chose Bedrock over alternatives (like self-hosting models) primarily for security reasons. Bedrock gives a written guarantee that:
See the official FAQ page here [4]. These guarantees are critical because we process customer source code files, which may contain sensitive information and trade secrets. Other reasons we chose Bedrock include:
However, Bedrock is not without issues. For instance, we were affected by a widely-reported issue [5] where AWS reset many customers' Bedrock quotas to 0, rendering Bedrock unavailable. The incident delayed our launch and prompted us to implement a fallback mechanism. Lessons from building an AI-powered code reviewerWe often have to pass along large text files to the LLM for code reviews. Sometimes processing many of them as part of a single pull request (PR). This differs from the typical chatbot use cases. The size and quantity of the files impacts both costs and limits. LLMs are still quite expensive.Reviewing PRs with thousands of lines can become expensive quickly, especially with models like Sonnet. While batching mode reduces costs, it adds complexity and can affect performance. This makes it difficult to build a sustainable and competitive business model. Without VCs funding to absorb the cost of growth, it’s a tough balancing act between:
Mind the model limits.For example, Claude 3.5 Sonnet has a context window of 200k and a maximum output of 8192 tokens. These are sufficient in most cases, still present challenges. For example,
Furthermore, for Claude 3.5 Sonnet, both Bedrock and Anthropic have a default throughput limit of a measly 50 inferences per minute. You can improve this throughput limit with cross-region inference [6] and request an account-level limit raise. You can also fall back to a less powerful model, such as Claude 3 Sonnet or Claude 3.5 Haiku, with higher throughput limits. Latency challenges.As you can see from the following trace in Lumigo [7], Bedrock takes 20 to 30s to review one file. Given the limited throughput and to avoid having one user exhaust all available throughput, we limit the number of concurrent requests to Bedrock per user. As such, large PRs can take several minutes to process. This presents a problem for user experience. It may take so long that the Lambda function times out (Lambda has a max timeout of 15 minutes), and the user doesn’t receive any results. You may also need to add heartbeat comments to the PR to inform the user that you’re still working on the review. Hallucinations is still a problem.Occasionally, the LLM identifies issues with non-existent line numbers or unrelated problems. These hallucinations happen rarely, but they do still happen. As a mitigation, you can use another LLM (perhaps a cheaper one) to verify the output (at the cost of latency and expense). The LLM is the easy part.One of my biggest takeaways is that the LLM is the easy part. While selecting the right model and crafting an effective prompt is important, it’s a small piece of the overall system. Most of our effort went into building the developer experience around the LLM. This includes:
The overall developer experience is the true differentiation for the product, not the LLM or the prompts. With so much hype around AI, it’s easy to overlook the importance of the “app” in an “AI-powered app.” Why AI code reviewers matterBoth Sam and I use AI tools heavily in our development workflow, including ChatGPT, Claude, Cursor, CodePilot and Amazon Q. While these tools are incredibly useful, they often produce suboptimal or insecure code. Luckily, Evolua would identify these coding issues as soon as we open the PR! If you’ve been using AI to write code, you should try Evolua today to catch bad code before it reaches production. ;-) Links[2] Demo of Evolua [3] EventBridge best practice: why you should wrap events in event envelopes [4] Bedrock FAQs [5] Reddit: Why did AWS reset everyone’s Bedrock Quota to 0? All production apps are down [6] Improve throughput with cross-region inference [7] Lumigo, the best observability platform for serverless architectures |
Join 17K readers and level up you AWS game with just 5 mins a week.
Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...
Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...
Lambda Durable Functions is a powerful new feature, but its checkpoint + replay model has a few gotchas. Here are five to watch out for. Non-deterministic code The biggest gotcha is when the code is not deterministic. That is, it might do something different during replay. Remember, when a durable execution is replayed, the handler code is executed from the start. So the code must behave exactly the same given the same input. If you use random numbers, or timestamps to make branching...