AWS re:Invent 2025 - Best practices for serverless developers (CNS403)
This presentation by Julian Wood, a Principal Serverless Developer Advocate at AWS, provides actionable insights and best practices for building secure, high-scale, and high-performance serverless applications on AWS. He uses the story of "Emily's Pastries" to illustrate the evolution of a business and its architecture from a small startup to an international success, highlighting common challenges and their serverless solutions.
Here are the key takeaways and best practices:
1. Sizing and Organizing Serverless Applications (4:11)
Problem
Monolithic Lambda functions become difficult to manage, debug, and lead to cascading failures (4:16-4:36). Organic growth can lead to too many functions, repos, and stacks, creating chaos (5:27-5:33).
Solution
- Single Responsibility Principle: Each Lambda function should have one clear responsibility, avoiding "do-it-all" functions (4:44).
- Right-Sizing: Allocate memory proportionally to CPU for optimal performance and cost (4:57-5:04). Higher memory can often reduce total cost if your code is CPU-bound (5:57-6:03).
- Pragmatic Approach: Start with cohesive functions and split only when real pain emerges (5:24-5:26).
- Domain-Driven Organization: Teams can choose their preferred runtime and tooling based on expertise and needs, aligning infrastructure choices with team strengths (5:47-6:11).
- Use a Framework: Utilize frameworks like SAM or CDK to simplify development (6:26-6:41).
- Repo Management: Avoid a repo for each function; a single repo can manage many services (6:43-6:59).
Outcome for Emily: 40% lower compute costs through right-sizing, organized by business domain, and enabled team autonomy (7:08-7:21).
2. Embracing Asynchronous Architecture (7:24)
Problem
Synchronous architectures lead to cascading system failures, frustrated customers, and abandoned orders during peak hours (7:34-7:59). One slow service can affect the entire workflow (8:11-8:13).
Solution
- Event-Driven Design: Use an event bus (like EventBridge) for service-to-service calls to achieve loose coupling, independent scaling, and failure isolation (8:54-9:06).
- Immediate Confirmation & Real-time Updates: Provide immediate order confirmation and use services like AppSync events for real-time updates back to the client (9:27-9:54).
- Lambda Event Source Mapping (ESM): Utilize ESM for robust async processing from various event sources (10:09-10:23). Features include content-based filtering (10:27-10:30), batching (10:30-10:38), and flexible start positions for streams (11:22-11:25).
- SQS for Buffering: Use SQS for message buffering during traffic spikes, providing decoupling, automatic scaling, and message durability (11:41-12:02).
SQS Configuration Best Practices
- Set visibility timeout to at least six times Lambda function duration (12:34-12:44).
- Configure redrive policy for Dead Letter Queues (DLQ) (12:46-13:03).
- Set long message retention (12:53-13:03).
Lambda ESM Configuration
- Filtering: Use positive or negative filtering to process only necessary messages, saving costs (13:48-14:18).
- Batch Sizes & Windows: Adjust batch sizes (e.g., start with 10) for efficient processing and use batch windows to improve latency during low traffic (14:20-14:50).
- Partial Batch Item Failures: Report failed records back to SQS to avoid retrying entire batches (14:53-15:04).
- Flow Control: Set maxConcurrency on the ESM to prevent overwhelming downstream services (15:09-15:27).
- Reserved Concurrency: Use Lambda reserved concurrency to ensure a function can scale up, but ensure it's higher than max concurrency if both are used (15:33-15:56).
- On-Failure Destinations: Configure Lambda onFailure destinations for invocation issues, complementing SQS DLQs (16:00-16:26).
Outcome for Emily: Drastically reduced Lambda functions and cold starts, improved latency, better error handling, and a monthly saving of $1,500 by eliminating unnecessary Lambda functions (23:38-23:49).
3. Avoiding Unnecessary Work with Step Functions and Direct Integrations (16:57)
Problem
Many Lambda functions are used for simple data transformations or routing, incurring high compute costs and cold starts for non-custom business logic (17:09-17:27).
Solution
- Configuration as Code: Replace many Lambda functions with native capabilities of other services (17:29-17:39).
- Direct Service Integrations: Use direct integrations between API Gateway and downstream services (e.g., DynamoDB, SQS, Step Functions) to avoid using Lambda as a proxy (17:59-18:16).
- Step Functions for Orchestration: Leverage Step Functions for complex workflows, direct service integrations, and built-in JSON processing (19:37-20:19).
- EventBridge Integration: Use Step Functions within a domain microservice and then emit events onto EventBridge for communication between different domains (20:22-20:48).
Standard vs. Express Workflows
- Standard Workflows: Long-running (up to a year), asynchronous, priced per state transition (20:53-21:10).
- Express Workflows: Fast, high-throughput (up to 5 minutes), can be synchronous, priced by executions and memory (20:57-21:19).
- Combine Workflows: Combine standard and express workflows for scenarios requiring both long-running processes and real-time responses (21:23-22:10).
Durable Functions (New)
Build workflows in your favorite programming language within Lambda functions with checkpointing, suspending, and resuming long-running operations (22:16-22:56). This can run for up to a year and you don't pay for wait time (22:54-22:56).
4. Optimizing Lambda Performance (24:03)
Problem
Poor Lambda performance leads to abandoned orders and revenue loss during peak times (24:14-24:22).
Solution
- Lambda Managed Instances (New): Offers EC2's full range and specificity with Lambda's operational simplicity for high-scale, steady-state workloads (24:32-25:25).
- Control Performance Factors: Focus on optimizing memory allocation, initialization code, function handler code, and package size (26:24-26:30).
- Memory Allocation: Adding more memory proportionally allocates more CPU, which can improve performance and reduce cost for CPU-bound code (26:31-27:17).
- Parallel Processing: Utilize multi-threading within Lambda function code for batch processing to take advantage of multiple cores (27:20-27:52).
- Lambda Power Tuning: Use this open-source tool to visualize and fine-tune the memory/power configuration of Lambda functions (28:02-28:24).
Cold Start Optimization
Focus on latency-sensitive user-facing workloads, as async workloads can usually tolerate some cold start latency (28:40-29:06).
- Efficient Initialization: Reduce package size, import specific models, minify production code, and use lazy initialization (29:16-29:46).
- Connection Management: Establish and reuse connections during the init phase (29:48-30:05).
- Runtime-Specific Optimizations: Apply specific optimization techniques for Java (SnapStart, SDK usage), JavaScript/TypeScript (modular SDKs, tree shaking), .NET (AOT compilation), and Python (import strategy, package size) (30:24-31:03).
- Native Compilation: Consider GraalVM for Java and .NET AOT for significant performance benefits in CPU-intensive, predictable workloads (31:05-31:49).
- Provisioned Concurrency: Pre-warm execution environments to eliminate cold starts for all languages. Best for predictable traffic patterns and mission-critical APIs (32:02-33:50).
- Lambda SnapStart: Runs the cold start process when you publish a function, resuming a snapshot when invoked. Ideal for cost-sensitive applications with unpredictable traffic (33:50-34:37). No additional cost.
- Snapshot Optimization: Use before checkpoint hooks (Java) and register before snapshot (C# .NET) to preload dependencies and aggressively perform tiered compilation during the init phase (34:40-35:50).
- Upgrade Runtime: Regularly upgrading your runtime can lead to significant performance improvements and cost reductions (36:37-36:46).
Presenter: Julian Wood - Principal Serverless Developer Advocate at AWS