I recently helped a client launch an AI code reviewer called Evolua [1]. Evolua is built entirely with serverless technologies and leverages Bedrock. Through Bedrock, we can access the Claude models and take advantage of its cross-region inference support, among other things. In this post, I want to share some lessons from building Evolua and offer a high level overview of our system. But first, here’s some context on what we’ve built. Here [2] is a quick demo of Evolua: ArchitectureThis is a high-level view of our architecture. Here are some noteworthy points:
Why Bedrock?We chose Bedrock over alternatives (like self-hosting models) primarily for security reasons. Bedrock gives a written guarantee that:
See the official FAQ page here [4]. These guarantees are critical because we process customer source code files, which may contain sensitive information and trade secrets. Other reasons we chose Bedrock include:
However, Bedrock is not without issues. For instance, we were affected by a widely-reported issue [5] where AWS reset many customers' Bedrock quotas to 0, rendering Bedrock unavailable. The incident delayed our launch and prompted us to implement a fallback mechanism. Lessons from building an AI-powered code reviewerWe often have to pass along large text files to the LLM for code reviews. Sometimes processing many of them as part of a single pull request (PR). This differs from the typical chatbot use cases. The size and quantity of the files impacts both costs and limits. LLMs are still quite expensive.Reviewing PRs with thousands of lines can become expensive quickly, especially with models like Sonnet. While batching mode reduces costs, it adds complexity and can affect performance. This makes it difficult to build a sustainable and competitive business model. Without VCs funding to absorb the cost of growth, it’s a tough balancing act between:
Mind the model limits.For example, Claude 3.5 Sonnet has a context window of 200k and a maximum output of 8192 tokens. These are sufficient in most cases, still present challenges. For example,
Furthermore, for Claude 3.5 Sonnet, both Bedrock and Anthropic have a default throughput limit of a measly 50 inferences per minute. You can improve this throughput limit with cross-region inference [6] and request an account-level limit raise. You can also fall back to a less powerful model, such as Claude 3 Sonnet or Claude 3.5 Haiku, with higher throughput limits. Latency challenges.As you can see from the following trace in Lumigo [7], Bedrock takes 20 to 30s to review one file. Given the limited throughput and to avoid having one user exhaust all available throughput, we limit the number of concurrent requests to Bedrock per user. As such, large PRs can take several minutes to process. This presents a problem for user experience. It may take so long that the Lambda function times out (Lambda has a max timeout of 15 minutes), and the user doesn’t receive any results. You may also need to add heartbeat comments to the PR to inform the user that you’re still working on the review. Hallucinations is still a problem.Occasionally, the LLM identifies issues with non-existent line numbers or unrelated problems. These hallucinations happen rarely, but they do still happen. As a mitigation, you can use another LLM (perhaps a cheaper one) to verify the output (at the cost of latency and expense). The LLM is the easy part.One of my biggest takeaways is that the LLM is the easy part. While selecting the right model and crafting an effective prompt is important, it’s a small piece of the overall system. Most of our effort went into building the developer experience around the LLM. This includes:
The overall developer experience is the true differentiation for the product, not the LLM or the prompts. With so much hype around AI, it’s easy to overlook the importance of the “app” in an “AI-powered app.” Why AI code reviewers matterBoth Sam and I use AI tools heavily in our development workflow, including ChatGPT, Claude, Cursor, CodePilot and Amazon Q. While these tools are incredibly useful, they often produce suboptimal or insecure code. Luckily, Evolua would identify these coding issues as soon as we open the PR! If you’ve been using AI to write code, you should try Evolua today to catch bad code before it reaches production. ;-) Links[2] Demo of Evolua [3] EventBridge best practice: why you should wrap events in event envelopes [4] Bedrock FAQs [5] Reddit: Why did AWS reset everyone’s Bedrock Quota to 0? All production apps are down [6] Improve throughput with cross-region inference [7] Lumigo, the best observability platform for serverless architectures |
Join 14K readers and level up you AWS game with just 5 mins a week. Every Monday, I share practical tips, tutorials and best practices for building serverless architectures on AWS.
During this week's live Q&A session, a student from the Production-Ready Serverless boot camp asked a really good question (to paraphrase): "When end-to-end testing an Event-Driven Architecture, how do you limit the scope of the tests so you don't trigger downstream event consumers?" This is a common challenge in event-driven architectures, especially when you have a shared event bus. The Problem As you exercise your system through these tests, the system can generate events that are consumed...
"High cohesion, low coupling" is one of the most misunderstood principles in software engineering. So, let's clear things up! TL;DR Cohesion is about the internal focus of a thing - how well its components work together to fulfil a single purpose. Coupling is about the external relationships between things - how much they depend on one another. Cohesion When applied to a code module, cohesion measures how closely related its functions are. An Authenticator module will likely have high...
This edition of the newsletter is written by Raj Saha, Principal Solutions Architect working at AWS where he designed multiple world-scale systems. He has trained students to get SA jobs through his bootcamp, and runs a YouTube channel "Cloud With Raj" with over 114K subscribers. Since all of you Yan’s readers are pretty savvy technically, let me go over how to represent your technical knowledge in the (in)famous STAR format, to get hired at high paying jobs. All big tech companies, including...