Category Archives: Uncategorized

TF401189: The source branch has been modified since the last merge attempt

If you are using Visual Studio Online and setup your project using TFS doing a code review is straightforward in visual studio.

However ff you are using Visual Studio Online and setup your project Git you might wonder how you can conduct a code review. You may do a quick search and then try ‘new pull request’ . You might get a weird error message you don’t understand.

TF401189: The source branch has been modified since the last merge attempt

Then you do another search (bing or google) you won’t find anything useful.

After trial and error I found that the root cause is that I didn’t use a topic branch. It is that simple.

Two useful links as below (unfortunately all the images are missing)

Debugging a weird process crash issue

I recently debugged a very interesting process crash issue. The code was running in the cloud so I could not just attach the debugger as easily as on my dev box. Besides using debugger should be the last option in my opinion. So I logged on to the machine. Well before that I had to go a through a process to get permission and get the environment ready for security/compliance reason. I observed that the process was recycling by using taskmgr. So I did the below things
1. Checked the logs. No exceptions. No errors. No logs related to the crash
2. Checked crash dump files. No files were there
3. Checked OS eventlog. No crash events. Normally for any process crash there is at least one event

I never saw a process crash like this. But I knew the related code which caused this. It was a background task kicked off by calling Task.Run(). I thought that there must be some unhandled exceptions but I checked the code. There was already a catch block which caught all exceptions. However it was still possible that the catch block might throw exception. So I put another catch all inside the catch block to make sure that no unhandled exception was thrown.

However it did not work. The process was still crashing. I asked around about how a process could crash since I was relatively new to the code base. I got some pointers but none of them could match what I observed.

Then I had to turn to the last option. I copied windbg to all the nodes (remember that this is distributed system and there are multiple replicas) and hooked it up with the process. Boom! I got an exception in the debugger and then the process crashed.

I looked at the exception which was a normal .net exception. I could not think of a reason how it could cause crash. Then I looked into the call stack and saw the below line.
000007f9`8538bb27 : 0000001b`9c29cf80 0000001c`2c50c880 0000001c`2c50c888 0000001c`2c50c890 : mscorlib_ni!System.Environment.Exit(Int32)+0x7b

That was really weird. My code was calling a library from another team. So I got the tool ILSpy (I don’t use .net reflector anymore btw) and looked into the code. Man the code was calling Environment.Exit() when a certain exception threw! That explained everything. As for why the coding was doing so this was a new library and I guess that the piece of code was copied from somewhere else.

From the above text it looked that it didn’t take much time. In reality two days passed.

StackOverflow Exception, Long time no see

I have not seen a stackoverflow exception for a long time. Today I happened to bump into one in random test code. I will explain how it happened. What the code is used for and why it was written this way are not interesting to me and are not in the scope of this post.

    public class ChildClass : BaseClass
        protected override void Dispose(bool disposing)
            if (disposing)
                //DO something

    public class BaseClass
        public void Dispose()

        protected virtual void Dispose(bool disposing)
            if (disposing)
                //DO something

As you can see the stackoverflow exception will happen as the below
1. Child.Dispose(bool)
2. Base.Dispose()
3. Child.Dispose(bool)
4. Base.Dispose()

Persistent MAC Address in Cloud (Microsoft Azure /Amazon AWS / Google GCP / OpenStack)

What is a NIC? It is the network adaptor on your machine. MAC address is the physical address of the adaptor. You can consider it as fixed on your machine for this discussion. However the NIC you get on a VM in cloud is a virtualized one. If you delete a VM and create a new one you will probably get a new MAC address. Why do we need a persistent MAC address? Many licensed software rely on MAC for the licenses. If you want to move licenses from one VM to another persistent MAC address becomes very useful.

It looks like a small feature. Let’s look at a few major cloud provides to see how they deal with this.

Microsoft Azure

It is not available yet. According to the post it is already planned.

Amazon AWS
It is already supported. The below statement is from the official AWS documentation.
The interface maintains its private IP addresses, Elastic IP addresses, and MAC address.

When did AWS start supporting this? I have no idea. From this post it was not supported yet in 3/2011 but it was supported in 2/2012.

Google Cloud Platform (GCP)
Google does not seem to support persistent MAC address from the public official doc.

It does not seem to support persistent MAC address from the public official doc.

C# import .pfx file to cert store and CryptographicException: Keyset does not exist

I had to add a .pfx file programmatically today but got an exception when accessing this file in code after importing . The simple demo code and the exception are as follows.

X509Store certificateStore = new X509Store(StoreName.My, StoreLocation.CurrentUser);

System.Security.Cryptography.CryptographicException: Keyset does not exist at System.Security.Cryptography.Utils.CreateProvHandle(CspParameters parameters, Boolean randomKeyContainer) at System.Security.Cryptography.Utils.GetKeyPairHelper(CspAlgorithmType keyType, CspParameters parameters, Boolean randomKeyContainer, Int32 dwKeySize, SafeP rovHandle& safeProvHandle, SafeKeyHandle& safeKeyHandle) at System.Security.Cryptography.RSACryptoServiceProvider.GetKeyPair() at System.Security.Cryptography.X509Certificates.X509Certificate2.get_PrivateKey()

I did a search and the top results took me to the direction of adding permission on the private key. There are two versions of code for this task.

private static void AddAccessToCertificate(X509Certificate2 cert)
  RSACryptoServiceProvider rsa = cert.PrivateKey as RSACryptoServiceProvider;
  if (rsa != null) {
    string keyfilepath = FindKeyLocation(rsa.CspKeyContainerInfo.UniqueKeyContainerName);
    FileInfo file = new FileInfo(keyfilepath + “\\” + rsa.CspKeyContainerInfo.UniqueKeyContainerName);
    FileSecurity fs = file.GetAccessControl();
    var sid = new SecurityIdentifier(WellKnownSidType.WorldSid, null);
    var everyone = (NTAccount)sid.Translate(typeof(NTAccount)); 
    fs.AddAccessRule(new FileSystemAccessRule(everyone, FileSystemRights.FullControl, AccessControlType.Allow));

private static void AddAccessToCertificate(X509Certificate2 cert)
  RSACryptoServiceProvider rsa = cert.PrivateKey as RSACryptoServiceProvider;
  if (rsa != null) {
    var cspParams = new CspParameters(rsa.CspKeyContainerInfo.ProviderType, rsa.CspKeyContainerInfo.ProviderName, rsa.CspKeyContainerInfo.KeyContainerName)
    Flags = CspProviderFlags.UseExistingKey | CspProviderFlags.UseMachineKeyStore, CryptoKeySecurity = rsa.CspKeyContainerInfo.CryptoKeySecurity};
   cspParams.CryptoKeySecurity.AddAccessRule(new CryptoKeyAccessRule(WindowsIdentity.GetCurrent().User,
   CryptoKeyRights.FullControl, AccessControlType.Allow));
   using (var rsa2 = new RSACryptoServiceProvider(cspParams))
   { // Only created to persist the rule change in the CryptoKeySecurity }

Unfortunately both solutions did not work for me. This code was just a part of my other code. Every time I had to change the code, build and re-run test case. It took me a couple of hours to prove that both solutions didn’t work. I think that it was time to try in differet direction. So I read more search liks. It turned out that the fix is simple. Just adding X509KeyStorageFlags.PersistKeySet in the constructor of X509Certificate2 made it work.

var cert = new X509Certificate2(certificatePath, password, X509KeyStorageFlags.PersistKeySet);
X509Store certificateStore = new X509Store(StoreName.My, StoreLocation.CurrentUser);

The source of this solution is

TurboTax 2014 installation Error: “.NET Framework Verification Tool can’t be found”

This was an very annoying TurboTax 2013 issue which took me a couple of hours. Look like that the solution I found applies to TurboTax 2014 as well. More importantly the solution is so simple and I think that in most cases people don’t need to look at other solutions at all.

Forgot password for your Azure VM?

Unfortunately Azure Portal doesn’t have this simple feature. The only solution I know is using Azure Powershell Cmdlets. There are different articles talking about this. But they are not concise at all. To me I just need simple steps to reset password for one single VM. So here you go. Just “8 simple steps” you can get your VM back.

  1. Install Azure PS if needed
  2. Set-ExecutionPolicy RemoteSigned
  3. Import-Module Azure and Add-AzureAccount
  4. Select-AzureSubscription -Default “Subscription-1” // use Get-AzureSubscription to find the name if needed
  5. $vm = Get-AzureVM -Name paulloutest // use Get-AzureVM to list all VMs if needed
  6. $adminCred = Get-Credential // DO NOT type machine name, just pure user name
  7. Set-AzureVMAccessExtension -VM $vm -UserName $adminCred.UserName -Password $adminCred.GetNetworkCredential().Password -ReferenceName “VMAccessAgent” | Update-AzureVM
  8. Restart-AzureVM -ServiceName $vm.ServiceName -Name $vm.Name

A design pattern for caching – two tier caching with strong consistency between servers

This is a very interesting and useful pattern in practice. I haven’t seen any books for classic patterns to talk about this. But I think that it should be a classic one for caching. To make it clear I think that it is an old pattern. I do not invent it and I do not discover it. I designed the solution for our problem then I realized that it should be a pattern. I just want to write it down since I didn’t know it is a pattern before designing the solution and the pattern is useful in real world. More importantly the solution was not so obvious to me before I did the work so it is worth writing it down.


  • You have a not scalable data source which has relatively stable data (e.g. config or metadata) but the data does change.
  • You have many servers (front ends and backends) to access that data. Since the data source is not scalable you need caching.
  • You want strong consistency between the servers. i.e. when one server sees a new version of data all other servers need to see the new version or newer version(s). Simple put it is shared caching.

Old Design – one tier caching

First you may ask why I didn’t use Zookeeper or equivalent service. Yes that would be ideal if it is available. Sometimes it is just not available in your project but you need to solve this problem.

Take a look at the old design as below. Azure Table was used for the shared caching. It is very scalable and actually fast than many people thought (tens of milliseconds for each access).


Why we need a new design

  • The config/meta data grows and one column/property doesn’t fit anymore in Azure Table. The size limit for one row/entity is 1MB. The size limit for one column/property is 64kb. The data is over 64kb now
  • As the data grows always getting the full data is not optimal
  • This is not a main reason for the design change or it is also not a major issue but the servers have clock skew. Sometimes it is not ideal and it is hard to reason about it or explain to others. Take an extreme case as an example. Let’s say you have only one server and it might take much longer for the servers to refresh data in case of clock skew

New Design – two tier caching

I call it two tier caching since it has two tiers now – In Memory and Azure Blob. The changes are as below

  • In memory cache is introduced
  • Azure Blob is used (in this case it performs as good as Azure Table)
  • When a server tries to read the data from the memory it will try to compare the version in the cache to the version in the blob first. If they are the same it will skip reading the data from the blob. Otherwise it will refresh the cache
  • A local refresh time is introduced so clock skew doesn’t matter anymore. so every a few mins one server will refresh the blob and all other servers will get the data from the blob and reset the local refresh time.


The new design solved our problem very well. The code for this pattern is not much. I left out some technical details on purpose so readers can think about them. Please try to answer the below questions.

  • What do you use for the version number? How to update it? Should you keep the version in a NoSql database or with the blob?
  • How about concurrent reads and writes on that single blob? We have hundreds of servers. What is the limit?
  • What is the right exception handling? The data source, the blob or the servers might be down.
  • How to test your code?
  • What tracing do you need for live site troubleshooting?
  • Are you confident in doing it in a hotfix? or you would rather do it in a major release.


An Important Problem and an Important Algorithm – Consensus and Paxos

Consensus is such an important problem and Paxos is such an important algorithm that I have to talk about them. If you ever see a web site or blog talking about distributed fundamentals and systems but not mentioning Consensus and Paxos you had better skip that web site or blog.

Why is it important? First it is so important in real world. We live under so many laws which were passed by consensus. We have so many meetings and emails at work mainly to reach agreement on something. In computer world processes/machines need to agree on many things such as committing or aborting a transaction, choosing a leader, granting a lock, replicated state machine, atomic broadcast and so on. These things are often the foundation of distributed systems and critical to higher level applications and services.

Then you might wonder why we need another post about Consensus and Paxos since there are already so many of them. The funny thing is that the reason we need another post is exactly because there are so many. More specifically here are my reasons.

  1. It is easy to get confused since there are so many articles. Even worse some articles have wrong information.
  2. An interesting observation is that some articles which explain Paxos are more complicated than the paper itself
  3. I won’t be able to explain Paxos very well in one post but I will point out one way by which if one follows he/she should be able to understand Paxos algorithm well but the route might not an easy one to everybody.
  4. I will probably talk about the implementation of Paxos a little bit.

I also have the below suggestions.

  1. Set right expectation. To mortals including me Consensus is a hard problem and Paxos is a complicated algorithm. Some posts do give people wrong impression that it is easy then people might not take enough time and efforts to understand it or work on it.
  2. Don’t read too many articles/papers (unfortunately I already did). You just need to read the good and essential ones. I will let you know what are the good and essential ones.
  3. If you get questions you can’t figure out after trying hard, ask people around or ask on Quora ( There are quite a few folks there who understand distributed systems very well.

Consensus problem itself is easy to understand. Basically how do multiple people/parties/participants/processes/nodes/… reach agreement? Let’s look at real world examples first. You are Friend A. You and Friend B want to set a time to have dinner together. It is easy. You can just call B to confirm a time which works for both of you. However please try to answer the below questions.

  • What if B is not available when you call?
  • What if you can’t call and you can only send messages?
  • How do you know your message reaches B?
  • If you ask B to send acknowledge back how does B know you receive his/her acknowledgement?

Let’s add Friend C to the case. Then one person among A,B and C can be the coordinator to communicate with the other two and negotiate a time. This is more work but still easy in real life. But I can ask the same questions earlier and more questions as below

  • How do you decide who will be the coordinator?
  • How do you want to reach consensus? by everyone agreeing on the same time or the majority agreeing on the same time is enough or something else?

Finally let’s say that 100 friends want to set a time to have a dinner party. This may need a dedicated coordinator to do the work. The coordinator will probably need to keep a list of all people. This is even not easy in real life and the process may need a few days or even more. I will ask all the questions earlier and also more questions as below.

  • What if the dedicated coordinator becomes sick and can’t perform the duty?
  • What if there are multiple coordinators and they work at the same time?
  • Who keep the list of all people and what if the list is missing?

How do we make sure that consensus can be made in the end?

Answering all the questions will help you understand how to solve the consensus problem in computer world.A few observation about computer world as below

  1. You can set simple rules to force everyone/every process/… to agree on the same thing. Let’s say that Process A and B agree on something. We can set a requirement and force C to agree on that if A and B agrees. In real life it might not work well that way
  2. You cannot make phone calls. Most likely “people” communicate with messages.
  3. The algorithm has to be efficient. You can’t use a few days to reach consensus on the time on the dinner party.
  4. It has to fault tolerant. One or more processes crashing should not block the consensus process if possible
  5. One process might take arbitrary time to process information. A message might take arbitrary time to deliver.

Let’s formalize what properties/requirements a consensus algorithm should satisfy.

  1. Agreement: everyone agrees on the same time for the dinner party even if he/she might not be able to come (in computer world a process might agree on something and then crash)
  2. Validity: the time should not be a random one from nowhere but be some value actually proposed by someone (thus the time makes sense)
  3. Termination: it can’t take forever to decide what time to have dinner. Otherwise it is meaningless. Interestingly it looks like it is happening in US politics. It seems that it will take forever for the GOP and democratic party to reach consensus on many issues like the immigration reform.

The agreement property and the validity property are called safety properties and the termination property is called liveness property. If you don’t understand safety and liveness, don’t worry. I just want to mention them since they will be useful concepts in my future posts. But simply speaking safety is about what is allowed and what is not allowed or nothing ‘bad’ happens. Liveness is about something must happen or something ‘good’ eventually happen.

Now you might have a natural question. Can consensus problem be solved? In real life it is possible in most cases but it looks like that it is not possible in some cases (e.g. no peace in some places of the world due to endless war/conflicts) but you are not sure. Optimistic people may say that 20 years later the war will be over and the peace will come. The pessimistic people may say that it is doomed. But scientifically speaking is consensus possible in all cases or not?

To answer that question we need to look at two system models as below. Basically how do we model the real world in computer world? There are two kinds of system models: synchronous systems and asynchronous systems. Something to keep in mind is that both are trying to model the real world but both are not accurate representation of real world. There is always gap between research/theory and practice.

A synchronous system has the below properties [3].

  • Every message is delivered within a fixed and known amount of time.
  • Every process takes steps at a fixed and know rate.
  • Everyone process has a clock, and all the clocks are synchronized.

On the contrary an asynchronous system doesn’t have these properties.

Here comes another natural question. Are the real world systems synchronous or asynchronous? Just look at the properties and think about it.

Can consensus be solved in synchronous system? The good news is yes. It can be solved even if there are server crashes. I will leave how to solve it as an exercise.

Can consensus be solved in asynchronous system? Unfortunately it is proved [1] that it is not possible. The famous result and proof is called FLP impossibility result. FLP are the acronym of the three authors’ last names. This result is not trivial and it won Dijkstra Prize. The result can be expressed very simply as below. If you want to understand more please feel free to read the paper [3] which is very easy to read.

A consensus is impossible in an asynchronous system if there is even one crash failure.

If consensus can’t be solved in asynchronous system then why do we still strive to find good distributed algorithms for it? it turned out that the systems in reality are not completely asynchronous. The important algorithm Paxos can actually guarantee that consensus can be reached as long as the system is eventually asynchronous (a system model closer to real world). Simply put it works very well in reality. It looks that this post is getting long and explaining Paxos may take even more text so I will stop here and continue the talk in my next Post.

[1] Michael J. Fischer, Nancy Lynch, and Mike Paterson | Impossibility of Distributed Consensus with One Faulty Process | April 1985 |

[2] wiki page for Consensus problem

[3] Seth Gilbert, Nancy Lynch | Perspectives on the CAP Theorem | 2012 |


Someone you have to know – Leslie Lamport and his Bakery Algorithm

If you don’t know about Leslie Lamport you can’t really say that you know about distributed systems. A not accurate analogy is that someone who claims to know about American history doesn’t know about George Washington or Thomas Jefferson. In case you know his name but not his Bakery algorithm I will show you why this algorithm is so interesting.

To put it short Mr. Lamport did seminal research work on distributed systems which laid the foundations of the theory of distributed systems. . He won the 2013 Turing Ward due to his work on distributed systems (I think that the award is overdue). His is also famous for the document preparation system LaTex and his work on temporal logic.

He is currently working as a principal researcher in Microsoft Research Silicon Valley Lab. He is 73 at the time of writing. One reason some people who are working on distributed systems might not know about him is because people working in engineering field read papers or read the history. The more important reason is that his work is usually at the very core or very low level of distributed systems. If you are a developer you probably don’t need to know about him to understand a system or get your work done. But a lot of us are using systems which are based on his work directly or indirectly every day. e.g. part of the system behind iCloud is using the Paxos algorithm invented by him. Many Google services are using Google Big Table which relies on GFS which uses Chubby which in turn uses Paxos at its core. Microsoft Azure and Bing which power numerous services are also using Paxos.

Leslie Lamport got his BS, MS and Phd degrees on Mathematics which I think is an advantage for him when he works as a computer scientist because he can write structured proofs easily. He has a lot of humor and wisdom which you can strongly feel it from his casual comments[2] and the interview[3]. His most cited paper is “Time, Clocks, and the Ordering of Events in a Distributed System.” and it was published in 1978 and among the most cite articles in Computer Science. The famous algorithm Paxos has a colorful history since it came out of his failed attempt to prove that no such algorithm existed[3]. The paper “The Part-Time Parliament” which introduced Paxos was written by him in 1990 but only published 8 years later in 1998. This paper won an ACM SIGOPS Hall of Fame Award in 2012 [2]. You can read the details at [7]. I strongly encourage that everyone should read the two papers. One caveat. Please read “Paxos made simple” (also by Lamport) instead for the Paxos algorithm. But if you can understand “The Part-Time Parliament” well the first time you read it you are probably a genius and should stop reading my blog. Also look at the years I put in here and you will have a rough idea about the development of the theory of distributed systems.

I particularly want to talk about his bakery algorithm for the below reasons.

  1. This is the first mutual exclusion algorithm I know which doesn’t use the low level mutual exclusion (e.g. CAS – Compare And Swap). In case you don’t think this is interesting the built-in mutual exclusion algorithms in Java, C# and other languages assume atomic reads and writes. The lock keyword and Interlocked class in C# use the low level hardware support (most likely CAS instruction). The synchronized keyword and AtomicInteger class in Java are also using the low level hard support (most likely CAS instruction).
  2. It is fault tolerant. If one thread/one process/one machine which participates the algorithm crashes the algorithm can still work.
  3. Lamport’s Bakery algorithm marked the beginning of his study on distributed algorithms.
  4. It is what Ms. Lamport is proud of the most in his career according to his interview [3].

The Bakery algorithm was introduced in his paper [5] in 1974. The analogy is very simple. Let’s say there is a bakery with a numbering machine at its entrance. Each customer needs to take a number which is unique. At any time only one customer can be served and that customer has the smallest number. In real life this is easy to understand it works perfectly. In computer world please think about the below problems.

  1. What if the number is being read while it is being generated for a single party (like a process)?
  2. What if multiple customers try to get the numbers at the same time and what if they get the same number?
  3. How does it work without low level hardware support?

I encourage everyone to read this paper (just 3 pages. Actually 2 pages with proper formatting).There are three sections in this paper: the problem, the solution and the proof. It is fun to read. The code is easy to understand. The proof might not be easy or interesting for most readers but you can just assume that it is right.

If you want to read code in java you can read the Wikipedia page [6]. Unfortunately the Java code on the Wikipedia is wrong. I didn’t bother to fix the Wikipedia because I want my readers to know what is wrong. The highlighted part is wrong and it should be entering.get(i) != 0. Otherwise the algorithm would lose the remarkable property that read and write can overlap.

  1.  public void lock(int pid) //thread ID
  2.  {
  3.      entering.set(pid, 1);
  4.      ticket.set(pid, max(ticket) + 1); //find max in the array and return it
  5.      entering.set(pid, 0);
  6.      for(int i = 0; i < ticket.length(); ++i)
  7.      {
  8.          if(i != pid)
  9.          {
  10.              while(entering.get(i) == 1){} //wait while other thread picks a ticket
  11.              while(ticket.get(i) != 0 &&( ticket.get(pid) > ticket.get(i) || (ticket.get(pid) == ticket.get(i) && pid > i)))
  12.              {}
  13.          }
  14.      }
  15.  }

[1] wikipedia

[2] My writings

[3] A Discussion with Leslie Lamport by Dejan Milojicic

[4] Leslie Lamport. “Time, Locks, …”

[5] The bakery algorithm (

[6] wikipedia

[7] The Part-Time Parliament