The how of monitoring your services

17.11.2020 | 5 minutes of reading time

Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure these correctly. In this blog post I take a look at two examples of what can (and for us, did) go wrong in monitoring. This is about how you monitor your services.

Example: TCP connections for monitored services

The first example will be about TCP connections, a proxy, and handshakes.

Expectation vs. reality

For one of our projects we use an authentication proxy which is talking to an LDAP server as backend. We came across connections piling up on the server hosting this proxy. At first, it was not clear what was causing these connections.

After the proxy was installed, I integrated it into our Zabbix monitoring. To verify the proxy is answering requests, I used Zabbix’ built-in check net.tcp.connect. At first all seemed fine. The check was doing exactly what I expected.

But after a while we saw connections to the backend piling up on the server running the proxy. As no one was using the proxy for authentication at that point, I suspected Zabbix causing the vast number of connections. The monitoring of the service wasn’t working as expected. But what exactly was happening?

Each time Zabbix initiated the check, it was doing a three-way TCP handshake …

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

… and after that tore down the connection:

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

In tcpdump, it looks like this:

That was expected, so why were there so many connections still left on the system?

The proxy responded correctly, so the Zabbix check said everything is fine. But what happened on the connection from the proxy to the backend system?

It turned out, the proxy was starting a TLS connection to the backend for every incoming TCP connection. It did not matter to the proxy, there was no data sent. But the TLS connection to the backend should not be a problem either. It should have been torn down when the TCP connection from Zabbix to the Proxy ended. But that is theory. In reality the TLS connection even persisted after the correct TCP teardown:

The Swiss army knife of networking

So, I found the connections piling up on the proxy. But I still did not know what was the real problem. I tried to get a more precise view by connecting to the proxy manually with netcat: nc -v backend.example.com 8636

But nothing happened. Each time I opened a connection with netcat to the proxy, it started a TLS connection to the backend. I closed netcat and after that, the proxy tore down the TLS connection to the backend. No connections piled up on the Proxy. What was different? After some more testing and man page reading I managed to reproduce the Zabbix behaviour with netcat: nc -z -v backend.example.com 8636

The parameter that did the trick was -z. It instructs netcat to close the connection after a successful connect:


-z      Specifies that nc should just scan for listening daemons, 
        without sending any data to them. It is an error to use 
        this option in conjunction with the -l option.

So, it is not a problem specific to Zabbix, but it seems to be the Proxy. During the tests with netcat I observed, the problem didn’t appear when I used netcat in interactive mode.

Perhaps everything is a timing problem?

Netcat offers another handy parameter for these tests:


-w timeout
             Connections which cannot be established or are idle 
             timeout after timeout seconds. The -w flag has no 
             effect on the -l option, i.e. nc will listen forever
             for a connection, with or without the -w flag. 
             The default is no timeout.

So, I tried it again with nc -w 1 -v control01.baremetal 8636 and it turned out, it works.

I did some more tests with this parameter and it worked without leftovers. Taking a closer look at the tcpdump traces, the TLS connection is not torn down when the initiating TCP connection to the Proxy ends before the TLS handshake finished. As soon as the TCP teardown sequence starts after the TLS handshake is done, the TLS connection also ended as expected(tcpdump view):

Monitoring the service

So, I used the netcat command to create a new Zabbix check with the slowed down TCP disconnect. It is not the perfect solution, but works fine for my situation.

To be complete, implementing the check for Zabbix did not work without problems. In short, it showed Zabbix also needs the parameter -d. Otherwise, it does something weird with stdin and the parameter -w 1 has no effect.

Example: State of a monitored service

This is another example of an application we monitor. There we monitored the availability and response time of an HTTP endpoint. The first approach was a simple HTTP GET showing these response times:

As you can see in the graph above, the response time piled up the more we queried the endpoint.

As it turned out, the application held a state associated with the endpoint. Be surprised, but not everything is stateless. This state grew bigger and bigger each time we queried the endpoint. Therefore, it took the application longer and longer to process our requests. The session timeout was too long, to discard the session between the monitoring queries.

The application had to be modified, so that it does not create a session for the endpoints used for monitoring. This is just an example and might also happen with disk space, memory, or CPU consumption.

Conclusion

Not only is it hard to define SLOs/SLIs and define the correct measures for a user perspective. As shown with these examples, it is also hard to monitor the services correctly without impacting the selected SLOs with your measurement. It’s crucial not to only know what service to monitor, but also how to monitor this service. The Observer effect is not only applicable to quantum physics.

Was this post helpful?

Likes

Blog author

Christian Zunker

Do you still have questions? Just send me a message.

fromChristian Zunker

Overview of hardened container base images

How to choose the best container base image? What does “best” mean in this context? This blog post will not try to determine the best base image. We will pick just one of the aspects: security. We will have a look at how you can give your container base...

CI/CD
IT-Security

9.8.2021 | 6 Minuten Lesezeit

Christian Zunker

Site Reliability Engineering: Software in Produktion betreiben

In letzter Zeit hat Site Reliability Engineering (SRE) viel Aufmerksamkeit erregt. Mit SRE kamen Metriken wie Service-Level Objectives (SLO), Service-Level Indicator (SLI) und Error Budget auf. Ebenso widmet sich SRE stark dem Betrieb von Software in...

Softwarearchitektur
Infrastructure
Softwareentwicklung

12.7.2021 | 7 Minuten Lesezeit

Christian Zunker

Site Reliability Engineering: Running software in production

Lately, Site Reliability Engineering (SRE) has been getting a lot of attention. With SRE came metrics such as Service-Level Objective (SLO), Service-Level Indicator (SLI), and error budget. The SRE discipline also details a lot about running software...

Software architecture
Infrastructure
Software development

1.7.2021 | 7 Minuten Lesezeit

Christian Zunker

How to use OAuth2 Proxy for central authentication

This blog post will show you how to use one central OAuth2 Proxy (see the official page ) as authentication proxy for multiple services inside your Kubernetes Cluster . The default example on how to secure a service with Nginx and OAuth2 Proxy shows...

Infrastructure
Microservices
Cloud
Kubernetes
IT-Security

7.6.2021 | 2 Minuten Lesezeit

Christian Zunker

Cynicism and burnout in Information Technology

Earlier this year, my colleague Nandor already wrote about passion and burnout . The following post will show my perspective on cynicism and burnout. Sadly, last year, @sadserver and @sadoperator retired their Twitter accounts. As stated in this blog...

24.8.2020 | 6 Minuten Lesezeit

Christian Zunker

Kubernetes deployment concepts

There is a wide variety of tools out there to deploy software to a Kubernetes cluster. In the context of these tools, even a new *Ops term emerged: GitOps . This article will not be another comparison of Kubernetes deployment tools but a comparison of...

CI/CD
DevOps
Kubernetes

5.8.2020 | 3 Minuten Lesezeit

Christian Zunker

Daniel Marks

Debugging Kubernetes Network Policies with ephemeral containers

As you are developing your new shiny containerized service on Kubernetes (k8s), you might also want to apply Network Policies . But during the process, you experience connection problems inside your containers. You followed best practices and kept your...

Software development
Kubernetes

22.7.2020 | 2 Minuten Lesezeit

Christian Zunker

Configuring Kubernetes login with Keycloak

Kubernetes does not have its own user management and relies on external providers like Keycloak. This blog post will describe how to configure Kubernetes to use Keycloak as an authentication provider. We are running Kubernetes clusters based on OpenStack...

16.5.2019 | 2 Minuten Lesezeit

Christian Zunker

Daniel Marks

Configure your Gitlab CI with docker-machine against keystone v3

We are running our Gitlab CI infrastructure on top of OpenStack . To not use a fixed number of VMs, we use Gitlab CI with docker-machine to create VMs as needed by the build jobs. This blog post will describe how to enable docker-machine to properly...

27.11.2018 | 2 Minuten Lesezeit

Christian Zunker

Measure your radosgw usage with OpenStack-Ansible

We use OpenStack-Ansible to set up our OpenStack cluster and Ceph’s Rados Gateway (radosgw) as object store backend. Unfortunately, the telemetry (and in consequence accounting) for radosgw will not work out of the box. You need to change different ...

Infrastructure
Cloud

25.7.2018 | 2 Minuten Lesezeit

Christian Zunker

Daniel Marks

Measuring your OpenStack Cloud with Gnocchi and Ceph storage backend

To solve our performance problems with Gnocchi and the whole OpenStack telemetry stack, we tried Gnocchi with Ceph as backend starting with OpenStack-Ansible Newton. The experience wasn’t good. Sooner or later, we experienced slow requests and stuck ...

Software architecture
Cloud
Open Source
Infrastructure

15.7.2018 | 4 Minuten Lesezeit

Christian Zunker

Daniel Marks

Monitoring für die Cloud

In diesem Artikel geht es um das Monitoring von Systemparametern wie CPU-Last, Speicherverbrauch etc. innerhalb einer Cloud. Das sind die klassischen Metriken, die man schon seit Jahrzehnten mit Monitoringsystemen abfragt. Warum sollte sich daran mit...

26.10.2015 | 6 Minuten Lesezeit

Christian Zunker

Docker Ambassador mit HAProxy und etcd

In einem vorherigen Artikel habe ich über den allgemeinen Aufbau des Ambassador in einem gemeinsamen Projekt der LeanIX GmbH mit der codecentric AG geschrieben. In diesem Artikel werde ich weiter auf die technischen Details des Ambassador Containers...

Pattern
Linux

13.8.2015 | 4 Minuten Lesezeit

Christian Zunker

Modifications to the CoreOS Ambassador Pattern

In this post I explain my changes to the ambassador pattern I implemented during a microservices project earlier this year. With Docker Links , Docker containers are able to communicate with each other over the network. When creating a Link, IP and exposed...

Pattern
Linux
Microservices

12.8.2015 | 2 Minuten Lesezeit

Christian Zunker

Case Study: Microservices bei LeanIX

Heute schreibe ich über ein gemeinsames Projekt der LeanIX GmbH und der codecentric AG , in dem die Architektur der leanIX Enterprise Architecture Management (EAM) Lösung verfeinert wurde. Die Architektur basierte bereits auf Microservices, die durch...

Microservices

11.8.2015 | 2 Minuten Lesezeit

Christian Zunker

Nicer Ansible output for Puppet tasks

In a previous post , I wrote about executing Puppet from within an Ansible playbook. But the output did not look very nice. In this post I take a closer look at how to change that. Just as a reminder, the output of Puppet looks like this, when called...

15.4.2015 | 4 Minuten Lesezeit

Christian Zunker

Migrate from Puppet to Ansible

In a previous post , I wrote about combining Ansible and Puppet, with Ansible as remote executor for arbitrary commands. In this post I take a look at how to migrate from Puppet to Ansible. Combine the Execution of Ansible and Puppet If you want to ...

17.12.2014 | 3 Minuten Lesezeit

Christian Zunker

Ansible as remote executor in a Puppet environment

When you are using Puppet you might know this problem: How can I execute arbitrary commands on all or some of my Puppet nodes? In this article, I explain how you can do so with Ansible . Ansible it another configuration management tool like Puppet and...

CI/CD
DevOps
Infrastructure
Open Source

21.9.2014 | 4 Minuten Lesezeit

Christian Zunker

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Terraform Remote State richtig nutzen

Was ist Terraform und was ist State?Terraform ist ein Tool für die Verwaltung von Infrastruktur in Form von Code, gehört also in den sogenannten Infrastructure-as-Code-Bereich (IaC). Eine kurze Einführung und ein Vergleich zu anderen Tools findet sich...

Infrastructure
Softwarearchitektur
Cloud
DevOps

21.4.2022 | 7 Minuten Lesezeit

Alexander Kasper

Site Reliability Engineering: Software in Produktion betreiben

Softwarearchitektur
Infrastructure
Softwareentwicklung

12.7.2021 | 7 Minuten Lesezeit

Christian Zunker

Automatisch skaliertes Cloud Native Consent Management in der Google Cloud

Immer häufiger ersetzen unsere Kunden lokale Rechenzentren durch eine Cloud-Infrastruktur. Die Gründe sind Ausfallsicherheit, Wartbarkeit und vor allem Skalierbarkeit. Mit dem letzten dieser Aspekte befassen wir uns in diesem Blogartikel anhand eines...

APM
Python
Cloud
Google Cloud
Infrastructure
Softwarearchitektur
Serverless

28.6.2021 | 16 Minuten Lesezeit

Markus Lüger

Christopher

Keycloak-Konfiguration mit Terraform

Infrastructure as Code (IaC) ist heutzutage aus der modernen IT-Landschaft nicht mehr wegzudenken. Red Hat beschreibt den Begriff wie folgt:Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through...

DevOps
Infrastructure
IT-Security
CI/CD
Keycloak
Open Source

2.3.2021 | 6 Minuten Lesezeit

Johanna Nolte

API Gateway und Service Mesh im Kontext von Service-Konnektivität

Wenn man sich mit der Entwicklung von Microservices und der Konnektivität dieser beschäftigt, stolpert man unweigerlich über Begriffe / Muster von API Gateway und Service Mesh. Aber warum gibt es diese Patterns bzw. Technologien überhaupt? Manchmal passiert...

Softwarearchitektur
Cloud
API
Infrastructure
Kubernetes

23.2.2021 | 4 Minuten Lesezeit

Daniel Kocot

Dennis Effing

Datenbankoperationen in Mule 4 optimieren

Häufig geht es in Mule-Projekten darum, Daten aus irgendeiner Quelle effizient in einer Datenbank abzulegen. Heute zeige ich, mit welchen Strategien man dabei die Performance optimieren kann.AufgabenstellungDa es hier primär um Datenbankoperationen geht...

APM
Integration

10.2.2021 | 8 Minuten Lesezeit

Roger Butenuth

Was ist FinOps?

Die Public Cloud ist eigentlich eine grandiose Sache. Mit wenig Aufwand kann man sich sehr schnell seine Umgebungen erstellen, ohne dass man an langwierige Bestellprozesse gebunden ist. Dies ermöglicht eine deutliche Beschleunigung von Entwicklung und...

Kultur
Infrastructure
IT-Governance
Cloud

23.8.2020 | 5 Minuten Lesezeit

Nicolas Byl

VR bist du? Ein Kreativtrip in virtuelle Welten

In der letzten Woche der NRW-Sommerferien fand in den Räumlichkeiten der VHS Hilden eine Projektwoche für junge Erwachsene ab 16 Jahren statt. In einer noch nie da gewesenen Kombination aus künstlerischem Gestalten, pädagogischem Anspruch, Videoproduktion...

Infrastructure
AR/VR

16.8.2020 | 2 Minuten Lesezeit

Christian Prison

Cloud-Frühjahrsputz: AWS-Kosten sparen

Frühling! Die Temperaturen steigen langsam, das Leben in der Natur beginnt sich wieder zu regen und die meisten Menschen machen in ihren Haushalten klar Schiff. Aber wieso sollte man den Frühjahrsputz auf den eigenen Keller oder Schuppen beschränken?...

Cloud
Infrastructure
IT-Governance

5.4.2020 | 4 Minuten Lesezeit

Nicolas Byl

Sebastian Jackel

Synchroner Batch mit Mule 4

Während in Mule 3 der Batch noch eine eigenständige Komponente war und Batches sich in der Konfiguration auf der gleichen Ebene wie Flows befanden, ist der Batch in Mule 4 zu einem sogenannten Scope geworden, der jetzt innerhalb eines Flows lebt. Auf...

Java
APM
JavaScript
Integration

28.1.2020 | 5 Minuten Lesezeit

Roger Butenuth

Hyperledger Fabric CouchDB lässt meine Cloud-Rechnung explodieren

Hyperledger Fabric ist eine hervorragende DLT-Plattform und bietet großartige Anpassungsmöglichkeiten. Eine Möglichkeit davon ist es, verschiedene Datenbanken zur Speicherung von Blockchain -Daten zu nutzen. Die empfohlene und am besten unterstützte ...

Blockchain
Datenbank
Infrastructure
Open Source

9.1.2020 | 2 Minuten Lesezeit

Jan Rümenapf

Norbert Schneider

Kong API-Gateway – Observability mit Prometheus, Grafana und OpsGenie

Im vorherigen Blogpost habe ich das bestehende Demo-Setup um decK und Konga erweitert. Nun soll es darum gehen, die vorhandenen Daten der APIs sichtbarer werden zu lassen. Hierzu möchte ich zwei Observability Patterns, nämlich Monitoring und Alerting...

Softwarearchitektur
Atlassian
Microservices
Open Source
API
APM

19.12.2019 | 4 Minuten Lesezeit

Daniel Kocot

Play-with-Docker: Container-Workshops auf AWS

Kubernetes- und Docker-Workshops sind sehr schwer vorzubereiten, Play-with-Docker und Play-with-Kubernetes können dabei aber eine große Hilfe sein. Die Dokumentation dazu ist leider nicht sehr umfangreich, wie man es schnell und einfach installieren ...

Infrastructure
Cloud
DevOps
Container
Kubernetes
Open Source

22.11.2019 | 9 Minuten Lesezeit

Sebastian Kornehl

Kubernetes Operator: Operations-Wissen als Code

In diesem Artikel erkläre ich, was ein Kubernetes Operator ist und wie er aufgebaut ist. Anschließend zeige ich euch, wie man seinen ersten eigenen Kubernetes Operator in Go schreibt.Was ist ein Kubernetes OperatorEin Kubernetes Operator hilft, eine ...

Infrastructure
Open Source
DevOps
Go
Kubernetes

29.10.2019 | 10 Minuten Lesezeit

Manuel

Kubernetes-Monitoring mit Instana (Teil 1)

Einführung: Weshalb Kubernetes und Instana?Cloud- oder cloud-ähnliche Dienste bedienen bekanntermaßen das “As a Service”-Prinzip. Egal ob “Software”, “Function” oder “Platform as a Service”, meist steckt eine containerbasierte Infrastruktur dahinter....

Infrastructure
APM
Kubernetes

13.10.2019 | 6 Minuten Lesezeit

Niko Blättermann

Maximilian Mayer

Web Performance – eine sehr kurze Einführung

Wer sich schon einmal mit der Entwicklung von Webseiten beschäftigt hat, ist sicherlich auch über die performance-relevanten Entwicklertools seines Browsers gestolpert. Dort finden sich allerhand Zahlen und Werkzeuge, die irgendwie Anhaltspunkte über...

APM
Softwareentwicklung

29.4.2019 | 11 Minuten Lesezeit

Marco Schäfer

Eigene Website in der AWS Cloud hosten: Tutorial für Anfänger

Es gibt weltweit sagenhafte 1,630,322,579 Webseiten, Tendenz steigend. Ob ein Consultant, Unternehmer oder kleine Startups, alle müssen de facto ihre Internetpräsenz maximieren, um neue Kunden zu gewinnen. Eine Webseite zu haben oder zu kaufen ist das...

Infrastructure
AWS
Cloud

24.4.2019 | 9 Minuten Lesezeit

Niklas Reimche

BDD und End-to-End-Tests – Cypress.io mit Cucumber verbinden

Cypress.io (oder kurz Cypress) bekommt momentan sehr viel Aufmerksamkeit, wenn es um das Thema End-to-End-Testing geht. Speziell im JavaScript-Umfeld scheint sich Cypress.io langsam durchzusetzen. Es macht vieles richtig und ist Selenium-basierten Ans...

JavaScript
BDD
APM
Testing

16.4.2019 | 10 Minuten Lesezeit

Holger Grosse-Plankermann

Augmented Reality mit der Hololens – Wie funktioniert das überhaupt?

Die Hololens ist bereits seit 2015 für Entwickler verfügbar. Mir war sie bisher nur aus den Medien oder von Konferenzen bekannt. Ich habe aktuell das Vergnügen, die Hololens genauer kennenzulernen und die reale Welt um mich herum virtuell zu erweitern...

UX/UI
Microsoft
AR/VR
Infrastructure

14.8.2018 | 3 Minuten Lesezeit

Markus Höfer

Hyperledger-Fabric-Test-Netzwerk mit Ansible auf AWS aufsetzen

Aus Unzufriedenheit mit bisherigen Cloud-basierten Lösungen zum Thema Hyperledger möchte ich in diesem Artikel das automatisierte Aufsetzen einer Testumgebung für Fabric (ein Fabric-Test-Netzwerk) motivieren und erläutern.Vorsicht: Dieser Artikel stammt...

Infrastructure
Blockchain
Open Source

12.8.2018 | 8 Minuten Lesezeit

Jonas Verhoelen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

The how of monitoring your services

Example: TCP connections for monitored services

Expectation vs. reality

The Swiss army knife of networking

Monitoring the service

Example: State of a monitored service

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Overview of hardened container base images

Site Reliability Engineering: Software in Produktion betreiben

Site Reliability Engineering: Running software in production

How to use OAuth2 Proxy for central authentication

Cynicism and burnout in Information Technology

Kubernetes deployment concepts

Debugging Kubernetes Network Policies with ephemeral containers

Configuring Kubernetes login with Keycloak

Configure your Gitlab CI with docker-machine against keystone v3

Measure your radosgw usage with OpenStack-Ansible

Measuring your OpenStack Cloud with Gnocchi and Ceph storage backend

Monitoring für die Cloud

Docker Ambassador mit HAProxy und etcd

Modifications to the CoreOS Ambassador Pattern

Case Study: Microservices bei LeanIX

Nicer Ansible output for Puppet tasks

Migrate from Puppet to Ansible

Ansible as remote executor in a Puppet environment

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Terraform Remote State richtig nutzen

Site Reliability Engineering: Software in Produktion betreiben

Automatisch skaliertes Cloud Native Consent Management in der Google Cloud

Keycloak-Konfiguration mit Terraform

API Gateway und Service Mesh im Kontext von Service-Konnektivität

Datenbankoperationen in Mule 4 optimieren

Was ist FinOps?

VR bist du? Ein Kreativtrip in virtuelle Welten

Cloud-Frühjahrsputz: AWS-Kosten sparen

Synchroner Batch mit Mule 4

Hyperledger Fabric CouchDB lässt meine Cloud-Rechnung explodieren

Kong API-Gateway – Observability mit Prometheus, Grafana und OpsGenie

Play-with-Docker: Container-Workshops auf AWS

Kubernetes Operator: Operations-Wissen als Code

Kubernetes-Monitoring mit Instana (Teil 1)

Web Performance – eine sehr kurze Einführung

Eigene Website in der AWS Cloud hosten: Tutorial für Anfänger

BDD und End-to-End-Tests – Cypress.io mit Cucumber verbinden

Augmented Reality mit der Hololens – Wie funktioniert das überhaupt?

Hyperledger-Fabric-Test-Netzwerk mit Ansible auf AWS aufsetzen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten