How to think about Stateless vs Stateful Backends

Stateful services store the state of the system in the server itself whereas stateless backends do not. They use a common storage layer and are horizontally scalable.

Why stateful is not used

If you scale using the stateful approach, the first and the most important roadblock you'll hit is how do you make sure user data is accessible to all the servers.

Since stateful servers store data within the instance (like user sessions), it is very hard and complex to implement a disaster recovery mechanism. Once recovered from the crash, you need to provide the servers with the data that was lost in the crash.

If a user request was processed by server #1 in a pool of 5 servers, pay extreme care to make sure that the user's request goes to server #1 every single time. Why? Because that user doesn't exist for other servers!

But how bad can the user experience get? Let's take a scenario:

A user is trying to log in using OTP (One Time Password) sent to their email ID.
They enter their email ID.
They land on the OTP page. They are supposed to enter the OTP received on their email ID.
They enter the OTP but the login fails.

What could've happened here? The server on step 2 and step 3 was not the same!

Imagine a server holding the user data drops dead? A lost customer in the worst case.

Workarounds?

Working with stateful backends invites a lot of complexities. How do you distribute users across your fleet of servers? You can do based on alphabets but that cannot guarantee a uniform load on all servers. Neither does partitioning using the user's location.

Moreover, you need to make sure all the information related to a particular user stays on the same server. It's also commonly known as Session affinity.

This makes stateless backends much more favorable and that's why they are common. But is it all rainbow & sunshine? Absolutely not.

Stateless backends take the performance hit

Stateless makes scaling a lot simpler relatively. Servers do not hold any user data so you can add new servers if one crashes without worrying about replicating or backing up user data. Containerization is possible.

However, it has its own downsides. As we fetch user data from a common storage layer (database or Redis cache for example), the access times will be slower. Because a TCP connection has its own overhead. Add SSL connection handshake and DNS resolution on top of that, multiplied by all the queries to the storage layer. You are now looking at latency.

How to choose

Stateless backends are preferred because:

They are easily scalable
You don't need complex calculations to make sure servers are handling uniform load.
Containerization is possible.
Horizontal scaling is a breeze as the storage layer is common and accessible by all the servers.
You don't need sticky sessions.
They are fault-tolerant as you don't have a single point of failure.

What next?

IBM's guide on managing states
For visual learners, here's a good YT video on the topic

Rishabh's Blog