AWS re:Invent 2025 - Best practices for serverless developers (CNS403)

This presentation by Julian Wood, a Principal Serverless Developer Advocate at AWS, provides actionable insights and best practices for building secure, high-scale, and high-performance serverless applications on AWS. He uses the story of "Emily's Pastries" to illustrate the evolution of a business and its architecture from a small startup to an international success, highlighting common challenges and their serverless solutions.

Here are the key takeaways and best practices:

1. Sizing and Organizing Serverless Applications (4:11)

Problem

Monolithic Lambda functions become difficult to manage, debug, and lead to cascading failures (4:16-4:36). Organic growth can lead to too many functions, repos, and stacks, creating chaos (5:27-5:33).

Solution

Single Responsibility Principle: Each Lambda function should have one clear responsibility, avoiding "do-it-all" functions (4:44).
Right-Sizing: Allocate memory proportionally to CPU for optimal performance and cost (4:57-5:04). Higher memory can often reduce total cost if your code is CPU-bound (5:57-6:03).
Pragmatic Approach: Start with cohesive functions and split only when real pain emerges (5:24-5:26).
Domain-Driven Organization: Teams can choose their preferred runtime and tooling based on expertise and needs, aligning infrastructure choices with team strengths (5:47-6:11).
Use a Framework: Utilize frameworks like SAM or CDK to simplify development (6:26-6:41).
Repo Management: Avoid a repo for each function; a single repo can manage many services (6:43-6:59).

Outcome for Emily: 40% lower compute costs through right-sizing, organized by business domain, and enabled team autonomy (7:08-7:21).

2. Embracing Asynchronous Architecture (7:24)

Problem

Synchronous architectures lead to cascading system failures, frustrated customers, and abandoned orders during peak hours (7:34-7:59). One slow service can affect the entire workflow (8:11-8:13).

Solution

Event-Driven Design: Use an event bus (like EventBridge) for service-to-service calls to achieve loose coupling, independent scaling, and failure isolation (8:54-9:06).
Immediate Confirmation & Real-time Updates: Provide immediate order confirmation and use services like AppSync events for real-time updates back to the client (9:27-9:54).
Lambda Event Source Mapping (ESM): Utilize ESM for robust async processing from various event sources (10:09-10:23). Features include content-based filtering (10:27-10:30), batching (10:30-10:38), and flexible start positions for streams (11:22-11:25).
SQS for Buffering: Use SQS for message buffering during traffic spikes, providing decoupling, automatic scaling, and message durability (11:41-12:02).

SQS Configuration Best Practices

Set visibility timeout to at least six times Lambda function duration (12:34-12:44).
Configure redrive policy for Dead Letter Queues (DLQ) (12:46-13:03).
Set long message retention (12:53-13:03).

Lambda ESM Configuration

Filtering: Use positive or negative filtering to process only necessary messages, saving costs (13:48-14:18).
Batch Sizes & Windows: Adjust batch sizes (e.g., start with 10) for efficient processing and use batch windows to improve latency during low traffic (14:20-14:50).
Partial Batch Item Failures: Report failed records back to SQS to avoid retrying entire batches (14:53-15:04).
Flow Control: Set maxConcurrency on the ESM to prevent overwhelming downstream services (15:09-15:27).
Reserved Concurrency: Use Lambda reserved concurrency to ensure a function can scale up, but ensure it's higher than max concurrency if both are used (15:33-15:56).
On-Failure Destinations: Configure Lambda onFailure destinations for invocation issues, complementing SQS DLQs (16:00-16:26).

Outcome for Emily: Drastically reduced Lambda functions and cold starts, improved latency, better error handling, and a monthly saving of $1,500 by eliminating unnecessary Lambda functions (23:38-23:49).

3. Avoiding Unnecessary Work with Step Functions and Direct Integrations (16:57)

Problem

Many Lambda functions are used for simple data transformations or routing, incurring high compute costs and cold starts for non-custom business logic (17:09-17:27).

Solution

Configuration as Code: Replace many Lambda functions with native capabilities of other services (17:29-17:39).
Direct Service Integrations: Use direct integrations between API Gateway and downstream services (e.g., DynamoDB, SQS, Step Functions) to avoid using Lambda as a proxy (17:59-18:16).
Step Functions for Orchestration: Leverage Step Functions for complex workflows, direct service integrations, and built-in JSON processing (19:37-20:19).
EventBridge Integration: Use Step Functions within a domain microservice and then emit events onto EventBridge for communication between different domains (20:22-20:48).

Standard vs. Express Workflows

Standard Workflows: Long-running (up to a year), asynchronous, priced per state transition (20:53-21:10).
Express Workflows: Fast, high-throughput (up to 5 minutes), can be synchronous, priced by executions and memory (20:57-21:19).
Combine Workflows: Combine standard and express workflows for scenarios requiring both long-running processes and real-time responses (21:23-22:10).

Durable Functions (New)

Build workflows in your favorite programming language within Lambda functions with checkpointing, suspending, and resuming long-running operations (22:16-22:56). This can run for up to a year and you don't pay for wait time (22:54-22:56).

4. Optimizing Lambda Performance (24:03)

Problem

Poor Lambda performance leads to abandoned orders and revenue loss during peak times (24:14-24:22).

Solution

Lambda Managed Instances (New): Offers EC2's full range and specificity with Lambda's operational simplicity for high-scale, steady-state workloads (24:32-25:25).
Control Performance Factors: Focus on optimizing memory allocation, initialization code, function handler code, and package size (26:24-26:30).
Memory Allocation: Adding more memory proportionally allocates more CPU, which can improve performance and reduce cost for CPU-bound code (26:31-27:17).
Parallel Processing: Utilize multi-threading within Lambda function code for batch processing to take advantage of multiple cores (27:20-27:52).
Lambda Power Tuning: Use this open-source tool to visualize and fine-tune the memory/power configuration of Lambda functions (28:02-28:24).

Cold Start Optimization

Focus on latency-sensitive user-facing workloads, as async workloads can usually tolerate some cold start latency (28:40-29:06).

Efficient Initialization: Reduce package size, import specific models, minify production code, and use lazy initialization (29:16-29:46).
Connection Management: Establish and reuse connections during the init phase (29:48-30:05).
Runtime-Specific Optimizations: Apply specific optimization techniques for Java (SnapStart, SDK usage), JavaScript/TypeScript (modular SDKs, tree shaking), .NET (AOT compilation), and Python (import strategy, package size) (30:24-31:03).
Native Compilation: Consider GraalVM for Java and .NET AOT for significant performance benefits in CPU-intensive, predictable workloads (31:05-31:49).
Provisioned Concurrency: Pre-warm execution environments to eliminate cold starts for all languages. Best for predictable traffic patterns and mission-critical APIs (32:02-33:50).
Lambda SnapStart: Runs the cold start process when you publish a function, resuming a snapshot when invoked. Ideal for cost-sensitive applications with unpredictable traffic (33:50-34:37). No additional cost.
Snapshot Optimization: Use before checkpoint hooks (Java) and register before snapshot (C# .NET) to preload dependencies and aggressively perform tiered compilation during the init phase (34:40-35:50).
Upgrade Runtime: Regularly upgrading your runtime can lead to significant performance improvements and cost reductions (36:37-36:46).

Presenter: Julian Wood - Principal Serverless Developer Advocate at AWS