Why Duplicating Environments for Microservices Backfires
Originally posted on The New Stack, by Arjun Iyer. Typical ways of testing microservices are too slow and unsustainable as engineering teams grow and architectures grow more complex. In microservices development, how long it takes to test your code changes in a production-like environment is critical. Long microservices testing cycles can significantly hamper developer productivity and slow down the entire release cadence. With developers needing to run these tests multiple times a day, even small delays can compound into major bottlenecks. As engineering teams and architectures grow, finding an efficient, scalable solution for testing microservices becomes paramount. Challenges of On-Demand Environments Many teams turn to on-demand environments as a solution, spinning up separate instances for each developer or team. Various implementations of this method spin up an environment within virtual machines (VMs), Kubernetes namespaces or even separate Kubernetes clusters. On-demand environments using namespace isolation in Kubernetes. While this approach seems logical at first, it often leads to challenges, such as the following. High Management Burden As system complexity increases, each environment requires a multitude of components: stateless services, load balancers, API gateways, databases, message queues and various cloud resources. Managing and updating these components across multiple environments becomes increasingly difficult. Divergence From Production To manage costs and complexity, teams often resort to using mocks and emulators for certain components. This leads to a divergence from the production environment, potentially reducing the reliability of tests. Data Management Complexities Maintaining and synchronizing data across multiple databases in numerous ephemeral environments is a significant challenge. This is especially problematic when dealing with large data sets or complex data relationships. Environment Staleness As the main branch of each microservice is continuously updated, ephemeral environments can quickly become outdated. This leads to tests being run against old versions of dependent services, reducing their effectiveness. Increased Startup Times As the complexity of these environments grows, so does the time required to spin them up. This delay directly impacts developer productivity and can slow the entire software development process. Cost Implications The financial impact of running multiple full environments is significant. Consider this example: For a system with 50 microservices, you might need an AWS EC2 m6a.8xlarge instance (32 vCPUs, 128 GiB memory) that costs approximately $1.30 per hour. Running this 24/7 for a month costs $936, or $11,232 per year for a single environment. To run 50 instances of this, the annual cost skyrockets to $561,600 — and that’s just for compute, not including storage, data transfer or managed services. Shared Environments and Sandboxes Instead Shared environments with application-layer isolation, called “sandboxes,” have emerged as a way to address these challenges. This concept, similar to what Uber has implemented for end-to-end testing, offers a more efficient and scalable solution. In this model, instead of spinning up separate environments for each developer or team, you use a shared environment. Within this shared space, you provide “tunable isolation” for every test client by sandboxing services and resources as needed. The services within sandboxes are accessed by dynamically routing requests based on request headers. Sandboxes within a shared environment. This approach offers several advantages: Resource efficiency: By sharing the underlying infrastructure, you significantly reduce resource usage and associated costs. Consistency: All tests run against the same baseline environment, eliminating “it works on my machine” issues and providing more reliable results. Reduced maintenance overhead: With a single shared environment to maintain, it’s more manageable to keep everything up to date. Faster startup time: Sandboxes can be created almost instantaneously, allowing developers to start testing without delay. Production-like testing: The shared environment can more closely mimic the production environment, improving the reliability and relevance of test cases. Implementation Considerations While the shared environment approach offers significant benefits, there are several key considerations for implementation. Context Propagation To ensure proper isolation within the shared environment, it’s crucial to propagate context through the services. This can be achieved efficiently using OpenTelemetry instrumentation. Its baggage and tracecontext standards are particularly useful for maintaining context across service boundaries. Data Isolation Careful attention must be paid to data partitioning,
Originally posted on The New Stack, by Arjun Iyer.
Typical ways of testing microservices are too slow and unsustainable as engineering teams grow and architectures grow more complex.
In microservices development, how long it takes to test your code changes in a production-like environment is critical. Long microservices testing cycles can significantly hamper developer productivity and slow down the entire release cadence.
With developers needing to run these tests multiple times a day, even small delays can compound into major bottlenecks. As engineering teams and architectures grow, finding an efficient, scalable solution for testing microservices becomes paramount.
Challenges of On-Demand Environments
Many teams turn to on-demand environments as a solution, spinning up separate instances for each developer or team. Various implementations of this method spin up an environment within virtual machines (VMs), Kubernetes namespaces or even separate Kubernetes clusters.
On-demand environments using namespace isolation in Kubernetes.
While this approach seems logical at first, it often leads to challenges, such as the following.
High Management Burden
As system complexity increases, each environment requires a multitude of components: stateless services, load balancers, API gateways, databases, message queues and various cloud resources. Managing and updating these components across multiple environments becomes increasingly difficult.
Divergence From Production
To manage costs and complexity, teams often resort to using mocks and emulators for certain components. This leads to a divergence from the production environment, potentially reducing the reliability of tests.
Data Management Complexities
Maintaining and synchronizing data across multiple databases in numerous ephemeral environments is a significant challenge. This is especially problematic when dealing with large data sets or complex data relationships.
Environment Staleness
As the main branch of each microservice is continuously updated, ephemeral environments can quickly become outdated. This leads to tests being run against old versions of dependent services, reducing their effectiveness.
Increased Startup Times
As the complexity of these environments grows, so does the time required to spin them up. This delay directly impacts developer productivity and can slow the entire software development process.
Cost Implications
The financial impact of running multiple full environments is significant. Consider this example:
For a system with 50 microservices, you might need an AWS EC2 m6a.8xlarge instance (32 vCPUs, 128 GiB memory) that costs approximately $1.30 per hour. Running this 24/7 for a month costs $936, or $11,232 per year for a single environment. To run 50 instances of this, the annual cost skyrockets to $561,600 — and that’s just for compute, not including storage, data transfer or managed services.
Shared Environments and Sandboxes Instead
Shared environments with application-layer isolation, called “sandboxes,” have emerged as a way to address these challenges. This concept, similar to what Uber has implemented for end-to-end testing, offers a more efficient and scalable solution.
In this model, instead of spinning up separate environments for each developer or team, you use a shared environment. Within this shared space, you provide “tunable isolation” for every test client by sandboxing services and resources as needed. The services within sandboxes are accessed by dynamically routing requests based on request headers.
Sandboxes within a shared environment.
This approach offers several advantages:
- Resource efficiency: By sharing the underlying infrastructure, you significantly reduce resource usage and associated costs.
- Consistency: All tests run against the same baseline environment, eliminating “it works on my machine” issues and providing more reliable results.
- Reduced maintenance overhead: With a single shared environment to maintain, it’s more manageable to keep everything up to date.
- Faster startup time: Sandboxes can be created almost instantaneously, allowing developers to start testing without delay.
- Production-like testing: The shared environment can more closely mimic the production environment, improving the reliability and relevance of test cases.
Implementation Considerations
While the shared environment approach offers significant benefits, there are several key considerations for implementation.
Context Propagation
To ensure proper isolation within the shared environment, it’s crucial to propagate context through the services. This can be achieved efficiently using OpenTelemetry instrumentation. Its baggage
and tracecontext
standards are particularly useful for maintaining context across service boundaries.
Data Isolation
Careful attention must be paid to data partitioning, especially for data being edited or deleted. A fundamental rule is that a test should not be able to mutate data it doesn’t create. This ensures that concurrent tests don’t interfere with each other’s data, maintaining the integrity of each sandbox.
Message Queue Handling
Special consideration is needed for message queues to ensure that sandboxes don’t compete for the same messages. This might involve implementing custom routing logic or using separate queues for each sandbox. Refer to Testing Kafka-based Asynchronous Workflows Using OpenTelemetry for details on how to implement isolation with asynchronous message queues.
Traditional Microservices Testing Approaches Are Unsustainable
As microservices architectures continue to grow in complexity, the traditional approach of duplicating entire environments for testing becomes increasingly unsustainable. The shared environment model, with its use of sandboxes for isolation, offers a more efficient, cost-effective and scalable solution.
This approach has already proven successful in several high-profile cases. Signadot has helped companies like Brex, Earnest and DoorDash streamline their microservices testing processes and improve developer productivity. Their experiences demonstrate the real-world applicability and benefits of this new approach to microservices testing.