I recently helped a client launch an AI code reviewer called Evolua [1]. Evolua is built entirely with serverless technologies and leverages Bedrock. Through Bedrock, we can access the Claude models and take advantage of its cross-region inference support, among other things. In this post, I want to share some lessons from building Evolua and offer a high level overview of our system. But first, here’s some context on what we’ve built. Here [2] is a quick demo of Evolua: ArchitectureThis is a high-level view of our architecture. Here are some noteworthy points:
Why Bedrock?We chose Bedrock over alternatives (like self-hosting models) primarily for security reasons. Bedrock gives a written guarantee that:
See the official FAQ page here [4]. These guarantees are critical because we process customer source code files, which may contain sensitive information and trade secrets. Other reasons we chose Bedrock include:
However, Bedrock is not without issues. For instance, we were affected by a widely-reported issue [5] where AWS reset many customers' Bedrock quotas to 0, rendering Bedrock unavailable. The incident delayed our launch and prompted us to implement a fallback mechanism. Lessons from building an AI-powered code reviewerWe often have to pass along large text files to the LLM for code reviews. Sometimes processing many of them as part of a single pull request (PR). This differs from the typical chatbot use cases. The size and quantity of the files impacts both costs and limits. LLMs are still quite expensive.Reviewing PRs with thousands of lines can become expensive quickly, especially with models like Sonnet. While batching mode reduces costs, it adds complexity and can affect performance. This makes it difficult to build a sustainable and competitive business model. Without VCs funding to absorb the cost of growth, it’s a tough balancing act between:
Mind the model limits.For example, Claude 3.5 Sonnet has a context window of 200k and a maximum output of 8192 tokens. These are sufficient in most cases, still present challenges. For example,
Furthermore, for Claude 3.5 Sonnet, both Bedrock and Anthropic have a default throughput limit of a measly 50 inferences per minute. You can improve this throughput limit with cross-region inference [6] and request an account-level limit raise. You can also fall back to a less powerful model, such as Claude 3 Sonnet or Claude 3.5 Haiku, with higher throughput limits. Latency challenges.As you can see from the following trace in Lumigo [7], Bedrock takes 20 to 30s to review one file. Given the limited throughput and to avoid having one user exhaust all available throughput, we limit the number of concurrent requests to Bedrock per user. As such, large PRs can take several minutes to process. This presents a problem for user experience. It may take so long that the Lambda function times out (Lambda has a max timeout of 15 minutes), and the user doesn’t receive any results. You may also need to add heartbeat comments to the PR to inform the user that you’re still working on the review. Hallucinations is still a problem.Occasionally, the LLM identifies issues with non-existent line numbers or unrelated problems. These hallucinations happen rarely, but they do still happen. As a mitigation, you can use another LLM (perhaps a cheaper one) to verify the output (at the cost of latency and expense). The LLM is the easy part.One of my biggest takeaways is that the LLM is the easy part. While selecting the right model and crafting an effective prompt is important, it’s a small piece of the overall system. Most of our effort went into building the developer experience around the LLM. This includes:
The overall developer experience is the true differentiation for the product, not the LLM or the prompts. With so much hype around AI, it’s easy to overlook the importance of the “app” in an “AI-powered app.” Why AI code reviewers matterBoth Sam and I use AI tools heavily in our development workflow, including ChatGPT, Claude, Cursor, CodePilot and Amazon Q. While these tools are incredibly useful, they often produce suboptimal or insecure code. Luckily, Evolua would identify these coding issues as soon as we open the PR! If you’ve been using AI to write code, you should try Evolua today to catch bad code before it reaches production. ;-) Links[2] Demo of Evolua [3] EventBridge best practice: why you should wrap events in event envelopes [4] Bedrock FAQs [5] Reddit: Why did AWS reset everyone’s Bedrock Quota to 0? All production apps are down [6] Improve throughput with cross-region inference [7] Lumigo, the best observability platform for serverless architectures |
Join 15K readers and level up you AWS game with just 5 mins a week.
Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning. But how exactly do you do that? How can you prevent accidental breaking changes from creeping in? You can detect and stop breaking changes: At runtime, when the events are ingested; During development, when schema changes are made; Or a combination of both! Here are three approaches you should consider. 1. Consumer-Driven Contracts In...
Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate. Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me: “How do I version my...
ICYMI, Serverless Inc. recently announced the Serverless Container Framework. It allows you to switch the compute platform between Lambda and Fargate with a one-liner config change. This is a game-changer for many organizations! It'd hopefully nullify many of the "lock-in" worries about Lambda, too. As your system grows, if Lambda gets expensive, you can easily switch to Fargate without changing your application code. To be clear, this is something you can already do yourself. It's not a...