Do You Really Need That Cache?

In System Design interviews it's common for the candidate to add a cache to a system for performance reasons, these days usually in the form of a Redis. While the sentiment is understandable, with the goal of making the best system they can, candidates often fail to address the complexity that the cache has added to their system. And real-life often mirrors the interview. When there is a performance problem it often feels easier to add a caching layer. Below are some things one should consider when adding a cache.

Is It Really Needed?

In the interview situation, often the cache is not actually needed to solve the problem because the interview question does not have performance requirements, usually they are about getting the system right rather than fast. And in real-life, it can be no different.

Some things one can do before adding a cache:

Does the thing benefit from a cache? Being slow isn't necessarily a problem and doesn't necessarily mean adding a cache will address the slowness. Consider a value that is read very infrequently.
Examine data access pattern and queries. Perhaps adding an index on a table is sufficient. Or if one is using a NoSQL database, perhaps the assumptions about how the data would be used are no longer true and how the data is written needs to change.
Where is the bottleneck? Are the servers overloaded? Is the database overloaded? Is there just not enough concurrency? Could adding more threads/processes per machine help?
Could the API be modified to work in batches of data? If a user wants to retrieve 10,000 values, do they have to perform 10,000 API calls or can they do it in one?

How Stale Can The Data Be?

The best kind of data to cache is immutable. The second best is data that times out. Much harder is data that must be invalidated when it is updated in the primary store. It's important to determine what kind of data is being cached before implementing it. If the data is immutable, or times out, cache invalidation is simple because the cache data itself contains the necessary knowledge for if it needs to be recalculated.

Having to invalidate a cache when data is updated can increase the complexity of a system, especially if it is distributed. In order to work properly, the cache needs to be invalidated and the primary data store needs to be updated. As distributed transactions don't work well in practice, these actions will not happen atomically, and since failure can happen at any time, eventually the system will fail between the two operations.

How stale can the data be?
Can the data be made immutable with some extra indirection? For example, the primary store maintains a list of content-addressed data which are then looked up in a cache. The primary store's list of addresses would never be hashed as that would be cheap to read.
Can a timeout be used to invalidate the cache?
When the primary store is updated, is it sufficient to invalidate the cache's value or must its value be updated in the cache?

Cache Stampede

Similarly to stale data, in a highly concurrent system, the caching system can be a cause of performance degradation or outage. Consider a value that is very expensive to calculate and the value is currently not cached, under a heavy load it is possible multiple API users might request the same value at the same time. If the system calculates the value for every request, the entire system might be put under heavy load. This is even harder to do in a distributed system where one process calculating a value does not necessarily know if another one is. There are ways to address this, such as using a lock service, but they can be heavy and increase the operational complexity of a system. One needs to consider what happens when the cache is emptied.

What happens when the cache value does not exist?
How likely is it that N readers will try to access the same value in the cache at the same time?
What happens if calculated cache values brings down the system? How can the system be recovered?

Order Matters

Since the cache is a separate system than the primary store, and failure can happen at any point in a system, it's important that the order values are updated will ensure the correct value is cached no matter what. For example, updating the cache before the primary store will mean that eventually the cache will contain value that never make it to the primary store, and will eventually be lost forever. Or consider a cached value that is updated while it is being recomputed and is request again and the second request takes much less time to compute than the first request, in this case the final value to be written could be an earlier value, because it took longer to compute. The system needs some way to tell if the value being written uses input data older than the what is there.

What happens if the system crashes during an update?
What happens if the input to the cached values is updated multiple times while it's being computed?
When updating a value in a cache, how does the system know that he value in the cache is not latest value?

Conclusion

Adding a cache to a can come with a significant increase in design and operational complexity. Before adding a cache, and the complexity, it is worth taking some time to be sure that the cache is needed and is the best solution to the problem. The best data to cache is data that never changes. On the other end of the spectrum, cached data that must reflect the current value in the primary store is the hardest to implement. A cache can also add new failure modes to a system that must be considered. In the worst case, they can cascading errors that make bringing the system back from failure very difficult.