Deep in the woods

Some time ago a colleague of mine was working on a problem with a license server we run and had asked me for help. He was trying to determine why some license checkouts to the server would be connect and checkout without incident and others would timeout. Some background; we dealt with a good amount of third party, close source code in order to run our customer workflows and licensed software was the norm for most of those workflows. To smooth everything out we stood up and maintained our own internal set of license servers which were basically invisible to our customers and while this made for a nice product to sell it also left us with the support burden of maintaining these code blobs.

Now my colleague had a habit not seeing the forest for getting lost in the trees and in this instance he had pinged me with a question about tcp header options. He had constructed a test scenario where he would emulate a busy server by using nc to connect to it while also using a third party tool to check out a license. In dumping packets he noticed a differences in TS val output between two captured tcp streams and had become convinced that the answer to the different behavior in the license server must be down to a difference in the traffic. It was not and if you’re curious the TS val is short for timestamp value.

Side effects in distributed systems

As it turns out the tcp was not to blame. I asked for a copy of the packet captures that my colleague was working with, dropped one into wireshark and immediately saw a tls connection wireshark Seeing this it was clear that there was no sense in chasing tcp packets, we needed to reason about the issue at the application layer. I asked him to replace his nc call with openssl s_client -connect ip:port which had an almost miraculous effect; licenses were now being checked out without issue.

There were differences though, nc would open and block in the terminal while openssl would not. The openssl terminated with a very helpful of verify error:num=19:self signed certificate in certificate chain. Fantastic confirmation that the license server is speaking tls and standards compliant tls to boot! Running the openssl connection again while disabling certificate checking reviled one more element; the server was expecting a valid client certificate. To recap; we observed that opening a plain tcp connection to the server causes license checkouts to fail, opening a tls connection fails and license checkouts continue as they should, and the license server is expecting clients to provide certificates.

On a hunch I ran the server binary through the strings tool and discovered that the openssl had been linked in. At this point I surmised that openssl was responsible for all connection handling and that there was some sort of blocking in there. After digging through the openssl docs I found the SSL_accept function with the following note.

NOTES
The behaviour of SSL_accept() depends on the underlying BIO.
If the underlying BIO is blocking, SSL_accept() will only return once the handshake has been finished or an error occurred.

In the case of our nc call the handshake was never failing because it was never starting. A shared network socket was blocking on a connection which would never succeed.

The actual problem

So, why was nc even used in testing at all? Well, given the importance of these license servers someone had the idea that we should have a health check for them and that health check looked like this

try:
  s = socket.create_connection(address, timeout=2)
  s.close()
  return true
except socket.error:
  return false

Initiate a connection and then immediately close it. We were investigating a denial of service directly caused by our own health check. The fix for this issue ended up giving code like

try:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.settimeout(2)
        sock.connect((address[0], address[1]))
        try:
            context = ssl.create_default_context()
            context.check_hostname = False
            context.verify_mode = ssl.CERT_NONE
            with context.wrap_socket(sock):
                return True
        except (ssl.SSLError, OSError):
            return True
except OSError:
    return False

Admittedly this is pretty convoluted logic. A tls connection is attempted with no certificate checking and regardless of success or failure we return true. Only in the event that the underlying socket fails do we return false. In particular why do we return true if a TLS connection succeeds? Wouldn’t that be a violation of the client certificate model discussed above? Yes it would, however this code is also used on license servers which don’t have the client cert restriction.

Conclusion

It seems that whoever wrote the original health check code didn’t consider that their code may have had side effects on the servers that it was checking. I think the real problem though is that we had no metrics watching checkouts and we couldn’t see the effect of the health check. Once this health check was added it became trusted and lead support engineers down false paths. The impact of this bug resulted in over a year of degraded service. So, validate your assumptions. Stop DOSing yourself.

Thanks for reading