A design pattern for caching – two tier caching with strong consistency between servers

This is a very interesting and useful pattern in practice. I haven’t seen any books for classic patterns to talk about this. But I think that it should be a classic one for caching. To make it clear I think that it is an old pattern. I do not invent it and I do not discover it. I designed the solution for our problem then I realized that it should be a pattern. I just want to write it down since I didn’t know it is a pattern before designing the solution and the pattern is useful in real world. More importantly the solution was not so obvious to me before I did the work so it is worth writing it down.

Problem

  • You have a not scalable data source which has relatively stable data (e.g. config or metadata) but the data does change.
  • You have many servers (front ends and backends) to access that data. Since the data source is not scalable you need caching.
  • You want strong consistency between the servers. i.e. when one server sees a new version of data all other servers need to see the new version or newer version(s). Simple put it is shared caching.

Old Design – one tier caching

First you may ask why I didn’t use Zookeeper or equivalent service. Yes that would be ideal if it is available. Sometimes it is just not available in your project but you need to solve this problem.

Take a look at the old design as below. Azure Table was used for the shared caching. It is very scalable and actually fast than many people thought (tens of milliseconds for each access).

caching_1

Why we need a new design

  • The config/meta data grows and one column/property doesn’t fit anymore in Azure Table. The size limit for one row/entity is 1MB. The size limit for one column/property is 64kb. The data is over 64kb now
  • As the data grows always getting the full data is not optimal
  • This is not a main reason for the design change or it is also not a major issue but the servers have clock skew. Sometimes it is not ideal and it is hard to reason about it or explain to others. Take an extreme case as an example. Let’s say you have only one server and it might take much longer for the servers to refresh data in case of clock skew

New Design – two tier caching

I call it two tier caching since it has two tiers now – In Memory and Azure Blob. The changes are as below

  • In memory cache is introduced
  • Azure Blob is used (in this case it performs as good as Azure Table)
  • When a server tries to read the data from the memory it will try to compare the version in the cache to the version in the blob first. If they are the same it will skip reading the data from the blob. Otherwise it will refresh the cache
  • A local refresh time is introduced so clock skew doesn’t matter anymore. so every a few mins one server will refresh the blob and all other servers will get the data from the blob and reset the local refresh time.

caching_2

The new design solved our problem very well. The code for this pattern is not much. I left out some technical details on purpose so readers can think about them. Please try to answer the below questions.

  • What do you use for the version number? How to update it? Should you keep the version in a NoSql database or with the blob?
  • How about concurrent reads and writes on that single blob? We have hundreds of servers. What is the limit?
  • What is the right exception handling? The data source, the blob or the servers might be down.
  • How to test your code?
  • What tracing do you need for live site troubleshooting?
  • Are you confident in doing it in a hotfix? or you would rather do it in a major release.

 

About these ads

6 thoughts on “A design pattern for caching – two tier caching with strong consistency between servers

  1. I agree that this is an undocumented design pattern — I have been using in-memory caches on our clients for our large amounts of static and semi-static data so that there isn’t constant network chatter to the server which has it’s own in-memory cache to reduce hits against the database.
    We use server-time synchronisation timestamps to keep the two sets of caches in sync.

    Nice article; it’s about time this one was named and documented.

      • Yes, clock skew can be a problem (especially when you’re not in charge of the customers systems).
        We have a general rule that the back-end server is the canonical time source.

        When the back-end refreshes it’s cache it takes a timestamp. When the client retrieved it’s cache from the server, the client takes a timestamp. From this point onwards, whenever the client accesses it’s cache, it compares its last load timestamp to the server’s last load timestamp to make sure it’s not out of sync. If it is out of sync, the client reloads the cache from the server and takes a new timestamp.
        Because we use the server as the canonical time source clock skew doesn’t interfere with the synchronisation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s