Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure these correctly. In this blog post I take a look at two examples of what can (and for us, did) go wrong in monitoring. This is about how you monitor your services.
Example: TCP connections for monitored services
The first example will be about TCP connections, a proxy, and handshakes.
Expectation vs. reality
For one of our projects we use an authentication proxy which is talking to an LDAP server as backend. We came across connections piling up on the server hosting this proxy. At first, it was not clear what was causing these connections.
After the proxy was installed, I integrated it into our Zabbix monitoring. To verify the proxy is answering requests, I used Zabbix’ built-in check
net.tcp.connect. At first all seemed fine. The check was doing exactly what I expected.
But after a while we saw connections to the backend piling up on the server running the proxy. As no one was using the proxy for authentication at that point, I suspected Zabbix causing the vast number of connections. The monitoring of the service wasn’t working as expected. But what exactly was happening?
Each time Zabbix initiated the check, it was doing a three-way TCP handshake …
… and after that tore down the connection:
In tcpdump, it looks like this:
That was expected, so why were there so many connections still left on the system?
The proxy responded correctly, so the Zabbix check said everything is fine. But what happened on the connection from the proxy to the backend system?
It turned out, the proxy was starting a TLS connection to the backend for every incoming TCP connection. It did not matter to the proxy, there was no data sent. But the TLS connection to the backend should not be a problem either. It should have been torn down when the TCP connection from Zabbix to the Proxy ended. But that is theory. In reality the TLS connection even persisted after the correct TCP teardown:
The Swiss army knife of networking
So, I found the connections piling up on the proxy. But I still did not know what was the real problem. I tried to get a more precise view by connecting to the proxy manually with netcat:
nc -v backend.example.com 8636
But nothing happened. Each time I opened a connection with netcat to the proxy, it started a TLS connection to the backend. I closed netcat and after that, the proxy tore down the TLS connection to the backend. No connections piled up on the Proxy. What was different? After some more testing and man page reading I managed to reproduce the Zabbix behaviour with netcat:
nc -z -v backend.example.com 8636
The parameter that did the trick was
-z. It instructs netcat to close the connection after a successful connect:
-z Specifies that nc should just scan for listening daemons,
without sending any data to them. It is an error to use
this option in conjunction with the -l option.
So, it is not a problem specific to Zabbix, but it seems to be the Proxy. During the tests with netcat I observed, the problem didn’t appear when I used netcat in interactive mode.
Perhaps everything is a timing problem?
Netcat offers another handy parameter for these tests:
Connections which cannot be established or are idle
timeout after timeout seconds. The -w flag has no
effect on the -l option, i.e. nc will listen forever
for a connection, with or without the -w flag.
The default is no timeout.
So, I tried it again with
nc -w 1 -v control01.baremetal 8636 and it turned out, it works.
I did some more tests with this parameter and it worked without leftovers. Taking a closer look at the tcpdump traces, the TLS connection is not torn down when the initiating TCP connection to the Proxy ends before the TLS handshake finished. As soon as the TCP teardown sequence starts after the TLS handshake is done, the TLS connection also ended as expected(tcpdump view):
Monitoring the service
So, I used the netcat command to create a new Zabbix check with the slowed down TCP disconnect. It is not the perfect solution, but works fine for my situation.
To be complete, implementing the check for Zabbix did not work without problems. In short, it showed Zabbix also needs the parameter
-d. Otherwise, it does something weird with stdin and the parameter
-w 1 has no effect.
Example: State of a monitored service
This is another example of an application we monitor. There we monitored the availability and response time of an HTTP endpoint. The first approach was a simple HTTP GET showing these response times:
As you can see in the graph above, the response time piled up the more we queried the endpoint.
As it turned out, the application held a state associated with the endpoint. Be surprised, but not everything is stateless. This state grew bigger and bigger each time we queried the endpoint. Therefore, it took the application longer and longer to process our requests. The session timeout was too long, to discard the session between the monitoring queries.
The application had to be modified, so that it does not create a session for the endpoints used for monitoring. This is just an example and might also happen with disk space, memory, or CPU consumption.
Not only is it hard to define SLOs/SLIs and define the correct measures for a user perspective. As shown with these examples, it is also hard to monitor the services correctly without impacting the selected SLOs with your measurement. It’s crucial not to only know what service to monitor, but also how to monitor this service. The Observer effect is not only applicable to quantum physics.