In System Design interviews it's common for the candidate to add a cache to a system for performance reasons, these days usually in the form of a Redis. While the sentiment is understandable, with the goal of making the best system they can, candidates often fail to address the complexity that the cache has added to their system. And real-life often mirrors the interview. When there is a performance problem it often feels easier to add a caching layer. Below are some things one should consider when adding a cache.
In the interview situation, often the cache is not actually needed to solve the problem because the interview question does not have performance requirements, usually they are about getting the system right rather than fast. And in real-life, it can be no different.
Some things one can do before adding a cache:
The best kind of data to cache is immutable. The second best is data that times out. Much harder is data that must be invalidated when it is updated in the primary store. It's important to determine what kind of data is being cached before implementing it. If the data is immutable, or times out, cache invalidation is simple because the cache data itself contains the necessary knowledge for if it needs to be recalculated.
Having to invalidate a cache when data is updated can increase the complexity of a system, especially if it is distributed. In order to work properly, the cache needs to be invalidated and the primary data store needs to be updated. As distributed transactions don't work well in practice, these actions will not happen atomically, and since failure can happen at any time, eventually the system will fail between the two operations.
Similarly to stale data, in a highly concurrent system, the caching system can be a cause of performance degradation or outage. Consider a value that is very expensive to calculate and the value is currently not cached, under a heavy load it is possible multiple API users might request the same value at the same time. If the system calculates the value for every request, the entire system might be put under heavy load. This is even harder to do in a distributed system where one process calculating a value does not necessarily know if another one is. There are ways to address this, such as using a lock service, but they can be heavy and increase the operational complexity of a system. One needs to consider what happens when the cache is emptied.
Since the cache is a separate system than the primary store, and failure can happen at any point in a system, it's important that the order values are updated will ensure the correct value is cached no matter what. For example, updating the cache before the primary store will mean that eventually the cache will contain value that never make it to the primary store, and will eventually be lost forever. Or consider a cached value that is updated while it is being recomputed and is request again and the second request takes much less time to compute than the first request, in this case the final value to be written could be an earlier value, because it took longer to compute. The system needs some way to tell if the value being written uses input data older than the what is there.
Adding a cache to a can come with a significant increase in design and operational complexity. Before adding a cache, and the complexity, it is worth taking some time to be sure that the cache is needed and is the best solution to the problem. The best data to cache is data that never changes. On the other end of the spectrum, cached data that must reflect the current value in the primary store is the hardest to implement. A cache can also add new failure modes to a system that must be considered. In the worst case, they can cascading errors that make bringing the system back from failure very difficult.