Monthly Archives: May 2014

An Important Problem and an Important Algorithm – Consensus and Paxos

Consensus is such an important problem and Paxos is such an important algorithm that I have to talk about them. If you ever see a web site or blog talking about distributed fundamentals and systems but not mentioning Consensus and Paxos you had better skip that web site or blog.

Why is it important? First it is so important in real world. We live under so many laws which were passed by consensus. We have so many meetings and emails at work mainly to reach agreement on something. In computer world processes/machines need to agree on many things such as committing or aborting a transaction, choosing a leader, granting a lock, replicated state machine, atomic broadcast and so on. These things are often the foundation of distributed systems and critical to higher level applications and services.

Then you might wonder why we need another post about Consensus and Paxos since there are already so many of them. The funny thing is that the reason we need another post is exactly because there are so many. More specifically here are my reasons.

  1. It is easy to get confused since there are so many articles. Even worse some articles have wrong information.
  2. An interesting observation is that some articles which explain Paxos are more complicated than the paper itself
  3. I won’t be able to explain Paxos very well in one post but I will point out one way by which if one follows he/she should be able to understand Paxos algorithm well but the route might not an easy one to everybody.
  4. I will probably talk about the implementation of Paxos a little bit.

I also have the below suggestions.

  1. Set right expectation. To mortals including me Consensus is a hard problem and Paxos is a complicated algorithm. Some posts do give people wrong impression that it is easy then people might not take enough time and efforts to understand it or work on it.
  2. Don’t read too many articles/papers (unfortunately I already did). You just need to read the good and essential ones. I will let you know what are the good and essential ones.
  3. If you get questions you can’t figure out after trying hard, ask people around or ask on Quora ( There are quite a few folks there who understand distributed systems very well.

Consensus problem itself is easy to understand. Basically how do multiple people/parties/participants/processes/nodes/… reach agreement? Let’s look at real world examples first. You are Friend A. You and Friend B want to set a time to have dinner together. It is easy. You can just call B to confirm a time which works for both of you. However please try to answer the below questions.

  • What if B is not available when you call?
  • What if you can’t call and you can only send messages?
  • How do you know your message reaches B?
  • If you ask B to send acknowledge back how does B know you receive his/her acknowledgement?

Let’s add Friend C to the case. Then one person among A,B and C can be the coordinator to communicate with the other two and negotiate a time. This is more work but still easy in real life. But I can ask the same questions earlier and more questions as below

  • How do you decide who will be the coordinator?
  • How do you want to reach consensus? by everyone agreeing on the same time or the majority agreeing on the same time is enough or something else?

Finally let’s say that 100 friends want to set a time to have a dinner party. This may need a dedicated coordinator to do the work. The coordinator will probably need to keep a list of all people. This is even not easy in real life and the process may need a few days or even more. I will ask all the questions earlier and also more questions as below.

  • What if the dedicated coordinator becomes sick and can’t perform the duty?
  • What if there are multiple coordinators and they work at the same time?
  • Who keep the list of all people and what if the list is missing?

How do we make sure that consensus can be made in the end?

Answering all the questions will help you understand how to solve the consensus problem in computer world.A few observation about computer world as below

  1. You can set simple rules to force everyone/every process/… to agree on the same thing. Let’s say that Process A and B agree on something. We can set a requirement and force C to agree on that if A and B agrees. In real life it might not work well that way
  2. You cannot make phone calls. Most likely “people” communicate with messages.
  3. The algorithm has to be efficient. You can’t use a few days to reach consensus on the time on the dinner party.
  4. It has to fault tolerant. One or more processes crashing should not block the consensus process if possible
  5. One process might take arbitrary time to process information. A message might take arbitrary time to deliver.

Let’s formalize what properties/requirements a consensus algorithm should satisfy.

  1. Agreement: everyone agrees on the same time for the dinner party even if he/she might not be able to come (in computer world a process might agree on something and then crash)
  2. Validity: the time should not be a random one from nowhere but be some value actually proposed by someone (thus the time makes sense)
  3. Termination: it can’t take forever to decide what time to have dinner. Otherwise it is meaningless. Interestingly it looks like it is happening in US politics. It seems that it will take forever for the GOP and democratic party to reach consensus on many issues like the immigration reform.

The agreement property and the validity property are called safety properties and the termination property is called liveness property. If you don’t understand safety and liveness, don’t worry. I just want to mention them since they will be useful concepts in my future posts. But simply speaking safety is about what is allowed and what is not allowed or nothing ‘bad’ happens. Liveness is about something must happen or something ‘good’ eventually happen.

Now you might have a natural question. Can consensus problem be solved? In real life it is possible in most cases but it looks like that it is not possible in some cases (e.g. no peace in some places of the world due to endless war/conflicts) but you are not sure. Optimistic people may say that 20 years later the war will be over and the peace will come. The pessimistic people may say that it is doomed. But scientifically speaking is consensus possible in all cases or not?

To answer that question we need to look at two system models as below. Basically how do we model the real world in computer world? There are two kinds of system models: synchronous systems and asynchronous systems. Something to keep in mind is that both are trying to model the real world but both are not accurate representation of real world. There is always gap between research/theory and practice.

A synchronous system has the below properties [3].

  • Every message is delivered within a fixed and known amount of time.
  • Every process takes steps at a fixed and know rate.
  • Everyone process has a clock, and all the clocks are synchronized.

On the contrary an asynchronous system doesn’t have these properties.

Here comes another natural question. Are the real world systems synchronous or asynchronous? Just look at the properties and think about it.

Can consensus be solved in synchronous system? The good news is yes. It can be solved even if there are server crashes. I will leave how to solve it as an exercise.

Can consensus be solved in asynchronous system? Unfortunately it is proved [1] that it is not possible. The famous result and proof is called FLP impossibility result. FLP are the acronym of the three authors’ last names. This result is not trivial and it won Dijkstra Prize. The result can be expressed very simply as below. If you want to understand more please feel free to read the paper [3] which is very easy to read.

A consensus is impossible in an asynchronous system if there is even one crash failure.

If consensus can’t be solved in asynchronous system then why do we still strive to find good distributed algorithms for it? it turned out that the systems in reality are not completely asynchronous. The important algorithm Paxos can actually guarantee that consensus can be reached as long as the system is eventually asynchronous (a system model closer to real world). Simply put it works very well in reality. It looks that this post is getting long and explaining Paxos may take even more text so I will stop here and continue the talk in my next Post.

[1] Michael J. Fischer, Nancy Lynch, and Mike Paterson | Impossibility of Distributed Consensus with One Faulty Process | April 1985 |

[2] wiki page for Consensus problem

[3] Seth Gilbert, Nancy Lynch | Perspectives on the CAP Theorem | 2012 |


Someone you have to know – Leslie Lamport and his Bakery Algorithm

If you don’t know about Leslie Lamport you can’t really say that you know about distributed systems. A not accurate analogy is that someone who claims to know about American history doesn’t know about George Washington or Thomas Jefferson. In case you know his name but not his Bakery algorithm I will show you why this algorithm is so interesting.

To put it short Mr. Lamport did seminal research work on distributed systems which laid the foundations of the theory of distributed systems. . He won the 2013 Turing Ward due to his work on distributed systems (I think that the award is overdue). His is also famous for the document preparation system LaTex and his work on temporal logic.

He is currently working as a principal researcher in Microsoft Research Silicon Valley Lab. He is 73 at the time of writing. One reason some people who are working on distributed systems might not know about him is because people working in engineering field read papers or read the history. The more important reason is that his work is usually at the very core or very low level of distributed systems. If you are a developer you probably don’t need to know about him to understand a system or get your work done. But a lot of us are using systems which are based on his work directly or indirectly every day. e.g. part of the system behind iCloud is using the Paxos algorithm invented by him. Many Google services are using Google Big Table which relies on GFS which uses Chubby which in turn uses Paxos at its core. Microsoft Azure and Bing which power numerous services are also using Paxos.

Leslie Lamport got his BS, MS and Phd degrees on Mathematics which I think is an advantage for him when he works as a computer scientist because he can write structured proofs easily. He has a lot of humor and wisdom which you can strongly feel it from his casual comments[2] and the interview[3]. His most cited paper is “Time, Clocks, and the Ordering of Events in a Distributed System.” and it was published in 1978 and among the most cite articles in Computer Science. The famous algorithm Paxos has a colorful history since it came out of his failed attempt to prove that no such algorithm existed[3]. The paper “The Part-Time Parliament” which introduced Paxos was written by him in 1990 but only published 8 years later in 1998. This paper won an ACM SIGOPS Hall of Fame Award in 2012 [2]. You can read the details at [7]. I strongly encourage that everyone should read the two papers. One caveat. Please read “Paxos made simple” (also by Lamport) instead for the Paxos algorithm. But if you can understand “The Part-Time Parliament” well the first time you read it you are probably a genius and should stop reading my blog. Also look at the years I put in here and you will have a rough idea about the development of the theory of distributed systems.

I particularly want to talk about his bakery algorithm for the below reasons.

  1. This is the first mutual exclusion algorithm I know which doesn’t use the low level mutual exclusion (e.g. CAS – Compare And Swap). In case you don’t think this is interesting the built-in mutual exclusion algorithms in Java, C# and other languages assume atomic reads and writes. The lock keyword and Interlocked class in C# use the low level hardware support (most likely CAS instruction). The synchronized keyword and AtomicInteger class in Java are also using the low level hard support (most likely CAS instruction).
  2. It is fault tolerant. If one thread/one process/one machine which participates the algorithm crashes the algorithm can still work.
  3. Lamport’s Bakery algorithm marked the beginning of his study on distributed algorithms.
  4. It is what Ms. Lamport is proud of the most in his career according to his interview [3].

The Bakery algorithm was introduced in his paper [5] in 1974. The analogy is very simple. Let’s say there is a bakery with a numbering machine at its entrance. Each customer needs to take a number which is unique. At any time only one customer can be served and that customer has the smallest number. In real life this is easy to understand it works perfectly. In computer world please think about the below problems.

  1. What if the number is being read while it is being generated for a single party (like a process)?
  2. What if multiple customers try to get the numbers at the same time and what if they get the same number?
  3. How does it work without low level hardware support?

I encourage everyone to read this paper (just 3 pages. Actually 2 pages with proper formatting).There are three sections in this paper: the problem, the solution and the proof. It is fun to read. The code is easy to understand. The proof might not be easy or interesting for most readers but you can just assume that it is right.

If you want to read code in java you can read the Wikipedia page [6]. Unfortunately the Java code on the Wikipedia is wrong. I didn’t bother to fix the Wikipedia because I want my readers to know what is wrong. The highlighted part is wrong and it should be entering.get(i) != 0. Otherwise the algorithm would lose the remarkable property that read and write can overlap.

  1.  public void lock(int pid) //thread ID
  2.  {
  3.      entering.set(pid, 1);
  4.      ticket.set(pid, max(ticket) + 1); //find max in the array and return it
  5.      entering.set(pid, 0);
  6.      for(int i = 0; i < ticket.length(); ++i)
  7.      {
  8.          if(i != pid)
  9.          {
  10.              while(entering.get(i) == 1){} //wait while other thread picks a ticket
  11.              while(ticket.get(i) != 0 &&( ticket.get(pid) > ticket.get(i) || (ticket.get(pid) == ticket.get(i) && pid > i)))
  12.              {}
  13.          }
  14.      }
  15.  }

[1] wikipedia

[2] My writings

[3] A Discussion with Leslie Lamport by Dejan Milojicic

[4] Leslie Lamport. “Time, Locks, …”

[5] The bakery algorithm (

[6] wikipedia

[7] The Part-Time Parliament