There are multiple ways one can build a scalable, robust, and reliable cloud-hosted product, and whether each particular way is “the right one” depends on how well it meets the requirements and expectations of product and market fit, usability, and cost.
In this blog, we will highlight one of the aspects of CyberCube’s product architecture and how it evolved over time. Namely, we describe our journey of moving from predominantly AWS serverless architecture towards the heavy use of containers via Kubernetes (k8s). Since every environment (a company’s product needs, market and customer expectations, the skill set of the engineering organization, time to market, etc.) differs, the reasoning and considerations we provide in this article should not be taken as gospel. Instead, they should be viewed as our experience through certain challenges and some valuable learning points we learned.
Although the AWS Serverless approach provides many benefits, sometimes it is more reasonable to move to a more classical approach. In our case, it is Kubernetes. We chose k8s mainly because of the following four factors: scalability, manageability, the large community size, and relative familiarity to developers.
NB! In the scope of this article, we will be looking at three services from AWS serverless in particular: AWS Lambda, AWS Batch and AWS Step function.
So, why did we decide to move?
For us, the answer consists of two parts: tech debt accumulated in the interest of expedience and issues arising from the technology itself.
The first group of issues
Code management - Given our prior experience, we chose a pretty common approach: one repo - one service. In our case, service equals lambda. So, despite this approach initially seeming like a good idea, as time passed we started to face issues with copy-pasting, lambda’s codebase growth, and the resulting difficulties that arise with changing anything in a system.
Quarkus - Being Java developers, we were searching for a framework that would counterbalance AWS Lambda limitations. After some research, we chose Quarkus. In hindsight, choosing this technology back in 2019 was a huge mistake because in addition to the AWS limitations, we also inherited all the Quarkus ones. These included:
1. The inability to use third-party libraries without tweaks while maintaining functionality.
2. Manual registration for reflection.
3. Errors in native modes, never seen in java mode.
4. 10+ minute compilation times in native images.
Complex flows handled by AWS - especially step functions. Despite the fact that they provide a big field of functionality, such services require you to split your code into two layers – AWS configuration and application’s code – which are dependent on each other. This usually leads to maintenance overhead and the inability to test anything locally.
The type of issues are:
AWS limitations and/or restrictions - Despite the fact that a lot of limitations fall into the field of good practices, sometimes they are too restrictive and do not give maneuvering room when flexibility is needed. For example, 256kb of payload for async lambdas, or having AWS batch execution script set up in CloudFormation with all of the execution arguments manually set up there. Yes, there are ways around it, but it doesn’t make it convenient.
Restrictive runtime (AKA watch how you write the code) - For instance using lambdas, you cannot use threads most of the time, or run async tasks while letting client requests go.
Inability to set up and test locally - Unfortunately, there are no great tools to run and test your code locally. The only tool is localstack but it has lots of limitations, such as the inability to debug state machines or only a remote debugging option for deployed lambdas.
Considerations if you need to move to k8s
I would raise several questions based on my own experience regarding whether it is reasonable to move to something classical.
What is your overall satisfaction or did you get it right?
Being comfortable - as an architect or lead developer - with the current choice. This one is purely subjective, but if you have never even thought about it, chances are that you are good with what you have.
- What is the computational load?
Lambda, for example, is five times more expensive than EC2 (comparing time and resources), so if you have a constant load it might be reasonable to move.
- What is the Usage Pattern?
For example, lambda is effective only if the system is event-based and the average execution time is small. If your app spends most of the time waiting for a response from a third party service, lambda might be a bad fit.
For us, this was the critical one. Analytics engines are relatively complex. Given the fact that most of the engines use the output of each other to perform calculations, it has a relatively high amount of sync-like operations. So it introduces a lot of boilerplate code where we wait for async responses from different engines to complete the current task. This also makes debugging much harder.
- Are you having issues with Code Management?
It is not the main reason for moving, but an additional point if you are already considering a move. If you have one repo - one service setup, the answer is how many times do you need to copy-paste something between the repos. If you have one repo - multiple services setup, how painful is redeploying n services after a small change?
If the responses to two or more of these questions support the migration, chances are you’d better move.
Considerations for the move
Let’s assess topics we need to consider to confirm we want to move and make it as painless as possible.
Aspects that need to be understood before the move:
Running costs - This is arguably the most important consideration. The calculation is rather straightforward. Calculate the required resources to run your system with a current load. Take lambda’s cost and compare it to the costs of running EC2 equivalent to the resources needed.
- Scalability - Almost all of AWS's serverless services have good (and fast) scalability. Given your usage pattern, try to calculate if you have sudden spikes, and make sure that the waiting time for your new system scale-up is appropriate.
- Memory Model - Serverless sometimes makes you less cautious about the way you utilize the memory because most of the services are on-demand. When moving to a long-running service, you should assess your memory model and run tests to make sure there are no memory leaks.
- Aspects that will definitely be worse:
- Maintenance overhead - One of the main advantages of a serverless system is a small maintenance overhead. Most things are done by AWS.
Environments - Using a serverless model, it is easy to spawn as many environments as you want without paying more. With k8s you often need to keep at least one service up and running. Unfortunately, this means that more environments are going to cost you more money.
Our story about moving to k8s
The first stage for us was deciding which framework to use next since we were not happy enough with Quarkus. Being tired of experiments, we decided to go with Spring. The entire team was familiar with it, and it has all the advantages we could possibly expect from a framework, such as a huge community, a large selection of additional modules for every possible tool, a lot of customization options, and high confidence in the framework itself. After the framework was chosen, we started to look at how to convert our Quarkus-based code to Spring. Fortunately, it was relatively easy since Spring has all the necessary features we used in Quarkus and the process was just changing annotations.
The second stage was a consolidation of logically related lambdas to the services since there is no point having nano services with this approach.
After lambdas are merged into services and the code was migrated to spring, the third stage has begun. This one includes setting up k8s and the rest of the infrastructure. Initially, we decided on how and where the k8s cluster will be deployed. Our answer was EKS as it is a managed service with a relatively low cost. The second step of this stage was deciding on the messaging service. After some discussion, the choice was RabbitMQ on AWS MQ. RabbitMQ was chosen because the team is familiar with it, it fits our requirements of being a transactional message broker, it supports sync messaging without any hassle, and it has great integration with Spring. MQ was chosen based on the same grounds as EKS – it is managed, so we do not need to manage the cluster.
The third stage was coming up with a solution for optimizing scaling. Since some of the engines can be scaled in more optimal ways rather than just CPU load, we needed a tool that can boot up pods based on queue metrics. For this job we selected KEDA. The reasons for this were simple declarative configs and reasonable customization.
After everything was created and configured, we proceeded with performance and regression tests which showed us that the new cluster performs as expected, without any significant decrease in performance.
The last stage was creating CI/CD pipelines for the new services, and that one is still in progress :)