In recent years, I have encountered many problems in IT companies caused by incorrect software architecture. What do I mean by incorrect? In most cases, this is one direction – either it is too trivial or incredibly complicated in relation to the problem it is supposed to solve. Both cases lead to performance problems and stop the organization from being agile.
What do I mean by software architecture?
Before I start talking about the problems caused by incorrect architecture, I would like to look at the concept of it. For me, it is everything that goes into producing software – from the strategic analysis of a domain and its requirements to translating them into a technical solution:
- Understanding the business domain and its capabilities
- Splitting it into subdomains and the definition of boundaries
- Decisions on infrastructure level – how, where and what do I want to achieve here. Cloud or on-premise? Do I need containers? K8s? API Gateway? Load Balancer? Cache? What type of databases? How do I want to approach monitoring, logging, and telemetry? Do I need Dev and Stage environment or only Production?
- Select proper deployment strategy – would it be a single (modular monolith) or multiple deployment units (microservices)? If we decide on microservices, what are the reasons (disintegrators) that drive us to this decision?
- Definition of architecture for modules/microservices based on strategic analysis of a domain – do I need an active record? Transaction script? Domain model? Should I apply there Event Sourcing? CQRS?
- Definition of test strategy – which tests do I want to perform? When? How often?
- Definition of release strategy – do I want to continuously deploy my application?
If you go through this list point by point, you’ll notice that there are plenty of other topics in each of them, which I’ll return to in a paragraph describing potential solutions to help you build better applications.
One of two major problems. If caught early enough, the impact will not be as great as in the case of overengineered architecture. The problem is that it is most often only seen when there are extreme issues with application performance and very large financial outlays to add new functionality. Additionally, the cost of maintenance starts to be high as well. In such cases, the most common solution companies opt for is refactoring from zero in another technology (a very bad idea) or using a group of consultants, often experts in the field (a very expensive idea).
How architecture can be too trivial? Imagine that you joined a clothing startup company as CTO. This is a great challenge in your career. You do not want to let anyone down, so you give your best to prove the MVP solution on time. There is one problem – you have to deliver the first results in 3 months. You do not think about the analysis of a business domain (not a good decision), and you do not care about any calculations (how many users, how big data will you store, or whether it should scale or not – at that point in time, acceptable solution). You do not have any time to learn new things, so you choose from what you know (good decision) and just for a few customers that will come in the beginning (as well a good decision). Why focus on tests – we will add them later (when will that time come?)
At that point in time, all your design decisions are acceptable – after all, you need to test your product as soon as possible on the market. On one condition – in case of success and more interest, you start iteratively refactoring the application. And this condition is most often not fulfilled. Why?
Our first customers want more functionality. We have more and more work and fewer and fewer resources. It takes ages to hire and bring in more people – and you do not have time for it. You just add more features. And the hell associated with overly simple architecture begins.
In the MVP version, you had a table of Products. It had 5 properties (there are already mistakes of messing boundaries but at that point in time only theoretically):
- Price (oops)
- Amount (oops for the 2nd time)
There are new requirements – we have to add type, material, and size. So, our products table grows. Let’s add it. Size can have different values – from S (small) to XXL (extra extra large). At some point, you will have a table with 80 different properties – where some will and some will be not used, depending on the type.
After several months your company decided to sell as well shoes. Only, the shoes are totally different sizes than e.g. T-shirts. In the EU it is 36, 42, or 46, and in the US 9, 10.5, or 12. So you add special rules which are handled only for shoes or only for coats or for something else. As you have just one model (Product) it is getting harder and harder to keep it up. Your codebase grows as your team does – everything is out of control. Instead of focusing on how to divide your code into modules (to make your code independent from each other), you grow your application into a big ball of mud (non-modular monolith).
At some point your customers start to see 2 main issues:
- The application gets slower (up to the extreme)
- Changes, fixes, and new features are released too rarely
You want to react but it is already too late.
It is not possible to make the application faster as you have huge tables with a lot of complex data inside (of course you fight, you add slave databases for reads optimization, you try to shard your data to optimize writes, maybe use cache – all in all, you complicate it more and more without thinking about proper modularization), you need to scale the entire application as a single deployment unit even if the majority of it does not need to scale – and it takes time and resources.
Every change in one place affects the other place, so whenever you add a new feature or try to fix functionality, then a lot of bugs appear in other places. As you do not have tests (or improper) you cannot release fast – someone (or a group of people) has to manually do the regression each time.
Your application ceases to be used, which leads to financial losses, which in turn leads to the collapse of the company.
The second problem, it is far more difficult to reverse than if the architecture is too simple. By the time this is noticed it is usually too late to make changes – for the reason that replacing, for example, one component causes gigantic costs. On top of that, in case we depend on other systems and our application is not deployed correctly, these problems will only pile up. In such cases, the only way is to replace elements in the system step by step (help from external consultants will be probably required here as well).
How architecture can be too complex? Imagine that you joined a corporation as a software architect. You are hired because there are no people who have experience in designing web systems. In the last few years, you heard about k8s, containers, microservices, Redis, NoSQL, Kafka, Domain-Driven Design, and many more. And now it is you and your team of other architects and developers who have the scepter of the decision in their hands. What do you do?
If you focus on strategic domain analysis then great. The first step towards a good system is taken. You met with different people – future customers, stakeholders, operations, and so on. Performed workshops where you collected and analyzed business requirements. It has been a tough few weeks, but in the end, we got there, you want to start building the system as soon as possible. So, together with your team, you decide:
- Go with microservices
- Containerize the application
- Run it in k8s cluster
- Add Redis
- Add Kafka
- Use NoSQL
- Apply DDD in each microservice
Unfortunately, you skipped one of the most important steps – you did not count how much traffic you will actually generate. Let’s stop here and focus on some magic numbers. Imagine that your application has a forecast for 10 million users, 1 million DAU (daily active users) and each user must upload a profile picture (max. 1 MB). Additionally, we assume that every user will do 100 requests daily. We cover only the DACH region (Germany, Austria, and Switzerland). A lot, mhm? In fact, it is not a lot!
- 1 000 000 users * 20 requests/day (assume that a day is 16 hours in this time zone, so 57 600 seconds) = 20 000 000 / 57600 seconds = 348 requests per second. In the peak, we can multiply it with e.g. 3, so around 1000. Simple infrastructure is capable of handling such amount of requests
- 10 000 000 users * 1 MB = 10 GB for profile pictures
And all above is just an optimistic forecast.
You spent a lot of time trying to prepare the greatest application possible, that every single microservice can scale and be published to k8s pod, and the data is optimized for reads (Redis) and writes (NoSQL). There are microservices where there is almost no business logic but DDD is applied. You use Kafka but you do not stream events etc. In the end, instead of 1 000 000 DAU you have 10 000.
There are now a couple of problems:
- When something does not work as expected, you have a lot of places to check
- Everything was added from scratch – not based on the needs but willing to use and premature optimization. In fact, you do not know if you need it or not, so in many cases, the system is too complex to maintain and the lack of knowledge is exposed
- Anyone who joins your team(s) has a very high entry threshold
- Costs of infrastructure could are increased by e.g. 70% in comparison to another, possible solution (but now it is too late to change it)
What you have achieved is certainly a nice modularization of the system – if the split to microservices was done correctly. Unfortunately, you introduced extreme complexity when it was not needed. This again will drive sooner or later to 2 problems from the subject:
- Performance – your application starts to randomly fail (due to errors in different components about which you do not have full knowledge), and your team performs slower because of the complexity
- Changes, fixes, and new features are released too rarely because you have to do parts of deployments and tests manually
Again, similar results to “too trivial” architecture occur – customers get angry and in the end, stop using your product.
What can we do?
There is no single, perfect recipe for making everything work as it should. However, there are some steps you can take to minimize the risk of the problems described:
- Perform strategic analysis of a business domain – organize workshops, collect and analyze business requirements (use one or combination of the following methods – Event Storming, Domain Storytelling, Impact Mapping, or Story Storming)
- Based on the above, divide your business domain into multiple subdomains. Define if this is core (complex business logic, something that makes your company unique in the market), supportive (less complex, in-house or outsourced, nothing really unique), or generic (you can buy it as it exists in the market and fits your needs). Then decide where to apply the transaction script, active record, or domain model
- Based on business analysis and questions raised, calculate the traffic that your application will have to handle – the number of reads and write operations, database transactions, and amount of data to be stored, etc. Include peak calculations – I prefer 3x standard estimate, some people do 5x or more
- Based on the above calculations, prepare the simplest infrastructure architecture that can handle it – no worries, if your application goes viral, then you will have a chance to use a lot of different frameworks, libraries, or components that are popular. Just start simple (KISS)
- Decide if you want to go with a single deployment unit (e.g. modular monolith) or multiple deployment units (e.g. microservices). If you decide on the second option, please find a reason – is there any area that is less fault tolerant than others? Is there an area that requires increased security? Do you think there is an area that will change more frequently? Is there any area that will be used heavily from day one? Do you have multiple teams in different time zones?
- Document your decisions through the architecture decision log – it should lay as close to code as possible, e.g. in the form of .adoc or. md files inside the repository
- Write tests for your architecture, e.g. using https://github.com/BenMorris/NetArchTest for .NET
- Divide your work into vertical slices – try to deliver a feature or usable part of it instead of first preparing the database, then the application layer, and then API
- Follow trunk-based development and pair/mob programming – you will omit the need for PRs as the review will be handled during coding by the second person. If you don’t think that you are ready for this, follow short-living branches – e.g. each branch has to be closed within 24 hours from the start. This way you have small PRs, frequent changes and easy merge/rebase
- Give a try to TDD (Test Driven Development) which is a great way to deliver software
- Automate as much as you can – in the best-case scenario there would be no manual step (continuous deployment). This means you will need a good test strategy and run them automatically each time
- Thanks to the above, you will be able to release multiple times per day. Don’t worry – if you cannot achieve this in production, be always ready to do that (continuous delivery) – in such case you can do continuous deployment on the Stage environment. If this is not possible due to some circumstances, try to release it every week or sprint
- Release a new version only to part of your users, then incrementally raise the number until you get 100%. You can look at my post related to canary releases
Extendable and maintainable software architecture is extremely important because it has a significant impact on the performance and agility of a software system. Properly designed architecture can make a system more scalable, maintainable, and easier to modify, which can save time and resources in the long run.
And what about you? What are your thoughts about this topic?