Tag Archives: distributed systems

A design pattern for caching – two tier caching with strong consistency between servers

This is a very interesting and useful pattern in practice. I haven’t seen any books for classic patterns to talk about this. But I think that it should be a classic one for caching. To make it clear I think that it is an old pattern. I do not invent it and I do not discover it. I designed the solution for our problem then I realized that it should be a pattern. I just want to write it down since I didn’t know it is a pattern before designing the solution and the pattern is useful in real world. More importantly the solution was not so obvious to me before I did the work so it is worth writing it down.


  • You have a not scalable data source which has relatively stable data (e.g. config or metadata) but the data does change.
  • You have many servers (front ends and backends) to access that data. Since the data source is not scalable you need caching.
  • You want strong consistency between the servers. i.e. when one server sees a new version of data all other servers need to see the new version or newer version(s). Simple put it is shared caching.

Old Design – one tier caching

First you may ask why I didn’t use Zookeeper or equivalent service. Yes that would be ideal if it is available. Sometimes it is just not available in your project but you need to solve this problem.

Take a look at the old design as below. Azure Table was used for the shared caching. It is very scalable and actually fast than many people thought (tens of milliseconds for each access).


Why we need a new design

  • The config/meta data grows and one column/property doesn’t fit anymore in Azure Table. The size limit for one row/entity is 1MB. The size limit for one column/property is 64kb. The data is over 64kb now
  • As the data grows always getting the full data is not optimal
  • This is not a main reason for the design change or it is also not a major issue but the servers have clock skew. Sometimes it is not ideal and it is hard to reason about it or explain to others. Take an extreme case as an example. Let’s say you have only one server and it might take much longer for the servers to refresh data in case of clock skew

New Design – two tier caching

I call it two tier caching since it has two tiers now – In Memory and Azure Blob. The changes are as below

  • In memory cache is introduced
  • Azure Blob is used (in this case it performs as good as Azure Table)
  • When a server tries to read the data from the memory it will try to compare the version in the cache to the version in the blob first. If they are the same it will skip reading the data from the blob. Otherwise it will refresh the cache
  • A local refresh time is introduced so clock skew doesn’t matter anymore. so every a few mins one server will refresh the blob and all other servers will get the data from the blob and reset the local refresh time.


The new design solved our problem very well. The code for this pattern is not much. I left out some technical details on purpose so readers can think about them. Please try to answer the below questions.

  • What do you use for the version number? How to update it? Should you keep the version in a NoSql database or with the blob?
  • How about concurrent reads and writes on that single blob? We have hundreds of servers. What is the limit?
  • What is the right exception handling? The data source, the blob or the servers might be down.
  • How to test your code?
  • What tracing do you need for live site troubleshooting?
  • Are you confident in doing it in a hotfix? or you would rather do it in a major release.


An Important Problem and an Important Algorithm – Consensus and Paxos

Consensus is such an important problem and Paxos is such an important algorithm that I have to talk about them. If you ever see a web site or blog talking about distributed fundamentals and systems but not mentioning Consensus and Paxos you had better skip that web site or blog.

Why is it important? First it is so important in real world. We live under so many laws which were passed by consensus. We have so many meetings and emails at work mainly to reach agreement on something. In computer world processes/machines need to agree on many things such as committing or aborting a transaction, choosing a leader, granting a lock, replicated state machine, atomic broadcast and so on. These things are often the foundation of distributed systems and critical to higher level applications and services.

Then you might wonder why we need another post about Consensus and Paxos since there are already so many of them. The funny thing is that the reason we need another post is exactly because there are so many. More specifically here are my reasons.

  1. It is easy to get confused since there are so many articles. Even worse some articles have wrong information.
  2. An interesting observation is that some articles which explain Paxos are more complicated than the paper itself
  3. I won’t be able to explain Paxos very well in one post but I will point out one way by which if one follows he/she should be able to understand Paxos algorithm well but the route might not an easy one to everybody.
  4. I will probably talk about the implementation of Paxos a little bit.

I also have the below suggestions.

  1. Set right expectation. To mortals including me Consensus is a hard problem and Paxos is a complicated algorithm. Some posts do give people wrong impression that it is easy then people might not take enough time and efforts to understand it or work on it.
  2. Don’t read too many articles/papers (unfortunately I already did). You just need to read the good and essential ones. I will let you know what are the good and essential ones.
  3. If you get questions you can’t figure out after trying hard, ask people around or ask on Quora (http://www.quora.com). There are quite a few folks there who understand distributed systems very well.

Consensus problem itself is easy to understand. Basically how do multiple people/parties/participants/processes/nodes/… reach agreement? Let’s look at real world examples first. You are Friend A. You and Friend B want to set a time to have dinner together. It is easy. You can just call B to confirm a time which works for both of you. However please try to answer the below questions.

  • What if B is not available when you call?
  • What if you can’t call and you can only send messages?
  • How do you know your message reaches B?
  • If you ask B to send acknowledge back how does B know you receive his/her acknowledgement?

Let’s add Friend C to the case. Then one person among A,B and C can be the coordinator to communicate with the other two and negotiate a time. This is more work but still easy in real life. But I can ask the same questions earlier and more questions as below

  • How do you decide who will be the coordinator?
  • How do you want to reach consensus? by everyone agreeing on the same time or the majority agreeing on the same time is enough or something else?

Finally let’s say that 100 friends want to set a time to have a dinner party. This may need a dedicated coordinator to do the work. The coordinator will probably need to keep a list of all people. This is even not easy in real life and the process may need a few days or even more. I will ask all the questions earlier and also more questions as below.

  • What if the dedicated coordinator becomes sick and can’t perform the duty?
  • What if there are multiple coordinators and they work at the same time?
  • Who keep the list of all people and what if the list is missing?

How do we make sure that consensus can be made in the end?

Answering all the questions will help you understand how to solve the consensus problem in computer world.A few observation about computer world as below

  1. You can set simple rules to force everyone/every process/… to agree on the same thing. Let’s say that Process A and B agree on something. We can set a requirement and force C to agree on that if A and B agrees. In real life it might not work well that way
  2. You cannot make phone calls. Most likely “people” communicate with messages.
  3. The algorithm has to be efficient. You can’t use a few days to reach consensus on the time on the dinner party.
  4. It has to fault tolerant. One or more processes crashing should not block the consensus process if possible
  5. One process might take arbitrary time to process information. A message might take arbitrary time to deliver.

Let’s formalize what properties/requirements a consensus algorithm should satisfy.

  1. Agreement: everyone agrees on the same time for the dinner party even if he/she might not be able to come (in computer world a process might agree on something and then crash)
  2. Validity: the time should not be a random one from nowhere but be some value actually proposed by someone (thus the time makes sense)
  3. Termination: it can’t take forever to decide what time to have dinner. Otherwise it is meaningless. Interestingly it looks like it is happening in US politics. It seems that it will take forever for the GOP and democratic party to reach consensus on many issues like the immigration reform.

The agreement property and the validity property are called safety properties and the termination property is called liveness property. If you don’t understand safety and liveness, don’t worry. I just want to mention them since they will be useful concepts in my future posts. But simply speaking safety is about what is allowed and what is not allowed or nothing ‘bad’ happens. Liveness is about something must happen or something ‘good’ eventually happen.

Now you might have a natural question. Can consensus problem be solved? In real life it is possible in most cases but it looks like that it is not possible in some cases (e.g. no peace in some places of the world due to endless war/conflicts) but you are not sure. Optimistic people may say that 20 years later the war will be over and the peace will come. The pessimistic people may say that it is doomed. But scientifically speaking is consensus possible in all cases or not?

To answer that question we need to look at two system models as below. Basically how do we model the real world in computer world? There are two kinds of system models: synchronous systems and asynchronous systems. Something to keep in mind is that both are trying to model the real world but both are not accurate representation of real world. There is always gap between research/theory and practice.

A synchronous system has the below properties [3].

  • Every message is delivered within a fixed and known amount of time.
  • Every process takes steps at a fixed and know rate.
  • Everyone process has a clock, and all the clocks are synchronized.

On the contrary an asynchronous system doesn’t have these properties.

Here comes another natural question. Are the real world systems synchronous or asynchronous? Just look at the properties and think about it.

Can consensus be solved in synchronous system? The good news is yes. It can be solved even if there are server crashes. I will leave how to solve it as an exercise.

Can consensus be solved in asynchronous system? Unfortunately it is proved [1] that it is not possible. The famous result and proof is called FLP impossibility result. FLP are the acronym of the three authors’ last names. This result is not trivial and it won Dijkstra Prize. The result can be expressed very simply as below. If you want to understand more please feel free to read the paper [3] which is very easy to read.

A consensus is impossible in an asynchronous system if there is even one crash failure.

If consensus can’t be solved in asynchronous system then why do we still strive to find good distributed algorithms for it? it turned out that the systems in reality are not completely asynchronous. The important algorithm Paxos can actually guarantee that consensus can be reached as long as the system is eventually asynchronous (a system model closer to real world). Simply put it works very well in reality. It looks that this post is getting long and explaining Paxos may take even more text so I will stop here and continue the talk in my next Post.

[1] Michael J. Fischer, Nancy Lynch, and Mike Paterson | Impossibility of Distributed Consensus with One Faulty Process | April 1985 | http://discolab.rutgers.edu/classes/cs519-old/papers/p374-fischer.pdf

[2] wiki page for Consensus problem http://en.wikipedia.org/wiki/Consensus_(computer_science)

[3] Seth Gilbert, Nancy Lynch | Perspectives on the CAP Theorem | 2012 | http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6122006


Someone you have to know – Leslie Lamport and his Bakery Algorithm

If you don’t know about Leslie Lamport you can’t really say that you know about distributed systems. A not accurate analogy is that someone who claims to know about American history doesn’t know about George Washington or Thomas Jefferson. In case you know his name but not his Bakery algorithm I will show you why this algorithm is so interesting.

To put it short Mr. Lamport did seminal research work on distributed systems which laid the foundations of the theory of distributed systems. . He won the 2013 Turing Ward due to his work on distributed systems (I think that the award is overdue). His is also famous for the document preparation system LaTex and his work on temporal logic.

He is currently working as a principal researcher in Microsoft Research Silicon Valley Lab. He is 73 at the time of writing. One reason some people who are working on distributed systems might not know about him is because people working in engineering field read papers or read the history. The more important reason is that his work is usually at the very core or very low level of distributed systems. If you are a developer you probably don’t need to know about him to understand a system or get your work done. But a lot of us are using systems which are based on his work directly or indirectly every day. e.g. part of the system behind iCloud is using the Paxos algorithm invented by him. Many Google services are using Google Big Table which relies on GFS which uses Chubby which in turn uses Paxos at its core. Microsoft Azure and Bing which power numerous services are also using Paxos.

Leslie Lamport got his BS, MS and Phd degrees on Mathematics which I think is an advantage for him when he works as a computer scientist because he can write structured proofs easily. He has a lot of humor and wisdom which you can strongly feel it from his casual comments[2] and the interview[3]. His most cited paper is “Time, Clocks, and the Ordering of Events in a Distributed System.” and it was published in 1978 and among the most cite articles in Computer Science. The famous algorithm Paxos has a colorful history since it came out of his failed attempt to prove that no such algorithm existed[3]. The paper “The Part-Time Parliament” which introduced Paxos was written by him in 1990 but only published 8 years later in 1998. This paper won an ACM SIGOPS Hall of Fame Award in 2012 [2]. You can read the details at [7]. I strongly encourage that everyone should read the two papers. One caveat. Please read “Paxos made simple” (also by Lamport) instead for the Paxos algorithm. But if you can understand “The Part-Time Parliament” well the first time you read it you are probably a genius and should stop reading my blog. Also look at the years I put in here and you will have a rough idea about the development of the theory of distributed systems.

I particularly want to talk about his bakery algorithm for the below reasons.

  1. This is the first mutual exclusion algorithm I know which doesn’t use the low level mutual exclusion (e.g. CAS – Compare And Swap). In case you don’t think this is interesting the built-in mutual exclusion algorithms in Java, C# and other languages assume atomic reads and writes. The lock keyword and Interlocked class in C# use the low level hardware support (most likely CAS instruction). The synchronized keyword and AtomicInteger class in Java are also using the low level hard support (most likely CAS instruction).
  2. It is fault tolerant. If one thread/one process/one machine which participates the algorithm crashes the algorithm can still work.
  3. Lamport’s Bakery algorithm marked the beginning of his study on distributed algorithms.
  4. It is what Ms. Lamport is proud of the most in his career according to his interview [3].

The Bakery algorithm was introduced in his paper [5] in 1974. The analogy is very simple. Let’s say there is a bakery with a numbering machine at its entrance. Each customer needs to take a number which is unique. At any time only one customer can be served and that customer has the smallest number. In real life this is easy to understand it works perfectly. In computer world please think about the below problems.

  1. What if the number is being read while it is being generated for a single party (like a process)?
  2. What if multiple customers try to get the numbers at the same time and what if they get the same number?
  3. How does it work without low level hardware support?

I encourage everyone to read this paper (just 3 pages. Actually 2 pages with proper formatting).There are three sections in this paper: the problem, the solution and the proof. It is fun to read. The code is easy to understand. The proof might not be easy or interesting for most readers but you can just assume that it is right.

If you want to read code in java you can read the Wikipedia page [6]. Unfortunately the Java code on the Wikipedia is wrong. I didn’t bother to fix the Wikipedia because I want my readers to know what is wrong. The highlighted part is wrong and it should be entering.get(i) != 0. Otherwise the algorithm would lose the remarkable property that read and write can overlap.

  1.  public void lock(int pid) //thread ID
  2.  {
  3.      entering.set(pid, 1);
  4.      ticket.set(pid, max(ticket) + 1); //find max in the array and return it
  5.      entering.set(pid, 0);
  6.      for(int i = 0; i < ticket.length(); ++i)
  7.      {
  8.          if(i != pid)
  9.          {
  10.              while(entering.get(i) == 1){} //wait while other thread picks a ticket
  11.              while(ticket.get(i) != 0 &&( ticket.get(pid) > ticket.get(i) || (ticket.get(pid) == ticket.get(i) && pid > i)))
  12.              {}
  13.          }
  14.      }
  15.  }

[1] wikipedia http://en.wikipedia.org/wiki/Leslie_Lamport

[2] My writings http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html

[3] A Discussion with Leslie Lamport by Dejan Milojicic http://research.microsoft.com/en-us/um/people/lamport/pubs/ds-interview.pdf

[4] Leslie Lamport. “Time, Locks, …” http://research.microsoft.com/users/lamport/pubs/pubs.html#time-clocks

[5] The bakery algorithm (http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#bakery)

[6] wikipedia http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

[7] The Part-Time Parliament http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#lamport-paxos

Start blogging on distributed systems and cloud computing

I have been working on building  cloud computing infrastructure as a developer for a few years. Overtime I came to realize that distributed systems play a vital role in cloud computing. In some sense cloud computing is really about building distributed systems and making them available to the public in the model of utility computing. Once I realized that I started to spend more time on researching on large scale distributed systems.

It turned out with no surprise that distributed systems are pretty complicated and the learning curve is steep. It really takes time and effort to understand the fundamentals and the real production distributed systems. I always believe that sharing my learning experience will help people who want to learn and also help me learn more. This is why I want to write blogs regularly to share my experience.

Distributed systems and cloud computing are huge topics and there is a lot to talk about. Here is my plan for sharing my learning experience. First I would like to talk about the basics and fundamentals of distributed systems like consensus problem and Paxos algorithm, CAP theorem, DHT (distributed hash table), distributed file system and so on. The reason is that without a good understanding the fundamentals it would be very hard to understand distributed systems. If you don’t believe me try to read any of the three papers (azure storage, dynamo, spanner) to see how much you can grasp. If you can understand all of them you don’t need to waste your time on reading my blog. Otherwise I believe that you can learn something from my blog. One of  my goals is that you will be able to at least understand all the 3 papers I mentioned. After talking about the basics and fundamentals of distributed systems I will talk about real production distributed systems like Big Table, Chubby, Spanner, Dynamo, Azure Storage, Cassandra, Zookeeper, Hadoop, Storm and so on. Then you can understand how real systems are built on top of the fundamentals, what are the engineering challenges and how they get solved. Believe me that there is a big gap between theory and practice in distributed systems area. The real systems need to fill the gap so you will see many interesting things by looking into the production systems.

Once we know enough about distributed systems I will talk about cloud computing which is a pretty hot area. I will probably not follow the exact order of finishing talking about distributed systems fundamentals first, then real distributed systems and then cloud computing. Instead I will talk about these topics in an indeterminate order. I will try to blog on a regular basis.

In the end people who follow on my blog should have a very good understanding of distributed systems and cloud computing. It should be helpful for job interviews, your projects at work, your curiosity about technologies and many other more cases.