This is a very interesting and useful pattern in practice. I haven’t seen any books for classic patterns to talk about this. But I think that it should be a classic one for caching. To make it clear I think that it is an old pattern. I do not invent it and I do not discover it. I designed the solution for our problem then I realized that it should be a pattern. I just want to write it down since I didn’t know it is a pattern before designing the solution and the pattern is useful in real world. More importantly the solution was not so obvious to me before I did the work so it is worth writing it down.
- You have a not scalable data source which has relatively stable data (e.g. config or metadata) but the data does change.
- You have many servers (front ends and backends) to access that data. Since the data source is not scalable you need caching.
- You want strong consistency between the servers. i.e. when one server sees a new version of data all other servers need to see the new version or newer version(s). Simple put it is shared caching.
Old Design – one tier caching
First you may ask why I didn’t use Zookeeper or equivalent service. Yes that would be ideal if it is available. Sometimes it is just not available in your project but you need to solve this problem.
Take a look at the old design as below. Azure Table was used for the shared caching. It is very scalable and actually fast than many people thought (tens of milliseconds for each access).
Why we need a new design
- The config/meta data grows and one column/property doesn’t fit anymore in Azure Table. The size limit for one row/entity is 1MB. The size limit for one column/property is 64kb. The data is over 64kb now
- As the data grows always getting the full data is not optimal
- This is not a main reason for the design change or it is also not a major issue but the servers have clock skew. Sometimes it is not ideal and it is hard to reason about it or explain to others. Take an extreme case as an example. Let’s say you have only one server and it might take much longer for the servers to refresh data in case of clock skew
New Design – two tier caching
I call it two tier caching since it has two tiers now – In Memory and Azure Blob. The changes are as below
- In memory cache is introduced
- Azure Blob is used (in this case it performs as good as Azure Table)
- When a server tries to read the data from the memory it will try to compare the version in the cache to the version in the blob first. If they are the same it will skip reading the data from the blob. Otherwise it will refresh the cache
- A local refresh time is introduced so clock skew doesn’t matter anymore. so every a few mins one server will refresh the blob and all other servers will get the data from the blob and reset the local refresh time.
The new design solved our problem very well. The code for this pattern is not much. I left out some technical details on purpose so readers can think about them. Please try to answer the below questions.
- What do you use for the version number? How to update it? Should you keep the version in a NoSql database or with the blob?
- How about concurrent reads and writes on that single blob? We have hundreds of servers. What is the limit?
- What is the right exception handling? The data source, the blob or the servers might be down.
- How to test your code?
- What tracing do you need for live site troubleshooting?
- Are you confident in doing it in a hotfix? or you would rather do it in a major release.