As part of running through the “New Migration Batch”-wizard, the remote endpoint (the on-premises Exchange server) is tested for its availability. After running this step, the following error is displayed:

image

By itself, this error message does not reveal much information as to what might be causing the connection issues. In the background, the wizard actually leverages the “Test-MigrationServerAvailability” cmdlet. If you run this cmdlet yourself, you will get a lot more information:

image

In this particular case, you’ll see that the issue is caused by 501 response from the on-premises server. The question is of course: why? We recently moved a number of mailboxes and then we did not encounter the issue. The only thing that had changed between then and now is that we reconfigured our load balancers in front of Exchange to use Layer 7 instead of Layer 4. So that is why I shifted my attention to the load balancers.

While reproducing the error, I took a look at the “System Message File” log in the KEMP load balancer. This log can be found under Logging Options, System Log Files. Although I didn’t expect to see much here, I saw the following message which drew my attention:

kernel: L7: badrequest-client_read [157.56.251.92:61541->192.168.2.130:443] (-501): <s:Envelope ? , 0 [hlen 1270, nhdrs 8]

A quick lookup learned that the 157.56.251.92 address was indeed coming from Microsoft. So now I knew for sure that something was wrong here. A quick search on the internet brought me to the following article which suggested to change the 100-Continue Handling in the Layer 7 configuration of the Load Master: http://blog.masteringmsuc.com/2013/10/kemp-load-balancer-and-lync-unified.html

After changing the value from its default (RFC Conformant), I could now successfully complete the wizard and start a hybrid mailbox move. So the “workaround” was found. But I was wondering, why does the Load Master *think* that the request coming from Microsoft is non-RFC compliant?

The first thing I did is ask Microsoft if they could clarify a bit on what was happening. I soon got a reply that – from Microsoft’s point of view – they were respecting the RFC documentation regarding the 100 (Continue) Status. No surprise here.

After reading the RFC specifications I decided to take some network traces to find out what was happening and maybe understand how the 501 response was triggered. The first trace I took, was one from the Load Master itself. In that trace, I could actually see the following:

image

Effectively, Office 365 was making a call to the Exchange Web Services and using the 100-continue status. As described per the RFC documentation, the Exchange on-premises server should now respond appropriately to the 100-continue status. Instead, we can see that in the entire SSL conversation, exactly 5 seconds go by after which Office 365 makes another call to the EWS virtual directory without having received a response to the 100-continue status. At the point, the KEMP Load Master generated the “501 Invalid Request”.

I turned back to the (by the way, excellent) support guys from KEMP and explained them my findings. Furthermore, when I tested without Layer 7 or even without a Load Master in between, there wasn’t a delay and everything was working as expected. So I knew for sure that the Exchange 2013 on-premises was actually replying correctly to the 100-continue status. As a matter of fact, without the KEMP LM in between, the entire ‘conversation’ between Office 365 and Exchange 2013 on-premises was perfectly following the RFC rules.

So, changing the 100-continue settings from “RFC Conformant” to “Ignore Continue-100” made sense as now KEMP would just ignore the 100-continue “rules”. But I was still interested in finding out why the LM thought the conversation was not RFC conformant in the first place. And this is where it gets interesting. There is this particular statement in the RFC documentation:

“Because of the presence of older implementations, the protocol allows ambiguous situations in which a client may send “Expect: 100- continue” without receiving either a 417 (Expectation Failed) status or a 100 (Continue) status. Therefore, when a client sends this header field to an origin server (possibly via a proxy) from which it has never seen a 100 (Continue) status, the client SHOULD NOT wait for an indefinite period before sending the request body.”

In fact, that was exactly what is happening here. Office 365 (the client) sent an initial 100-continue status and waited for a response to that request. In fact, it waits for exactly 5 seconds and sends the payload, regardless of it having received a response. In my opinion, this falls within the boundaries of the scenario described above. However, talking to the KEMP guys there seems to be a slightly different interpretation of the RFC which caused this mismatch and therefore the KEMP issuing the 501.

In the end, there is still something we haven’t worked out entirely: why the LM doesn’t send back the Continue-100 status back to Office 365 even though it receives it back almost instantaneously from the Exchange 2013 server.

All in all, the issue was resolved rather quickly and we know that changing the L7 configuration settings in the Load Master solves the issue (and this workaround was also confirmed as being the final solution by KEMP support, btw). Again, changing the 100-continue handling setting too “Ignore” doesn’t render the configuration (or the communication between Office 365 or Exchange on-premises) non-RFC compliant. So there’s no harm in changing it.

I hope you found this useful!

-Michael