Overview

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode 3

9 Comments

Previously on OpenStack Crime Investigation … Two load balancers running as virtual machine in our OpenStack based cloud, sharing a keepalived based highly available IP address started to flap, switching the  IP address back and forth. After ruling out a misconfiguration of keepalived and issues in the virtual network, I finally got the hint that the problem might originate not in the virtual, but in the bare metal world of our cloud. Maybe high IO was causing the gap between the keep alive VRRP packets.

When I arrived at baremetal host node01, hosting virtual host loadbalancer01, I was anxious to see the IO statistics. The machine must be under heavy IO load when the virtual machine’s messages are waiting for up to five seconds.

I switched on my iostat flash light and saw this:

$ iostat
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              12.00         0.00       144.00          0        144
sdc               0.00         0.00         0.00          0          0
sdb               6.00         0.00        24.00          0         24
sdd               0.00         0.00         0.00          0          0
sde               0.00         0.00         0.00          0          0
sdf              20.00         0.00       118.00          0        118
sdg               0.00         0.00         0.00          0          0
sdi              22.00         0.00       112.00          0        112
sdh               0.00         0.00         0.00          0          0
sdj               0.00         0.00         0.00          0          0
sdk              21.00         0.00        96.50          0         96
sdl               0.00         0.00         0.00          0          0
sdm               9.00         0.00        64.00          0         64

Nothing? Nothing at all? No IO on the disks? Maybe my bigger flash light iotop could help:

$ iotop

Unfortunately, what I saw was too ghastly to show here and therefore I decided to omit the screenshots of iotop1. It was pure horror. Six qemu processes eating the physical CPUs alive in IO.

So, no disk IO, but super high IO caused by qemu. It must be network IO then. But all performance counters show almost no network activity. What if this IO wasn’t real, but virtual? It could be the virtual network driver! It had to  be the virtual network driver.

I checked the OpenStack configuration. It was set to use the para-virtualized network driver vhost_net.

I checked the running qemu processes. They were also configured to use the para-virtualized network driver.

ps aux | grep qemu
libvirt+  6875 66.4  8.3 63752992 11063572 ?   Sl   Sep05 4781:47 /usr/bin/qemu-system-x86_64
 -name instance-000000dd -S ... -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 ...

I was getting closer! I checked the kernel modules. Kernel module vhost_net was loaded and active.

$ lsmod | grep net
vhost_net              18104  2
vhost                  29009  1 vhost_net
macvtap                18255  1 vhost_net

I checked the qemu-kvm configuration and froze.

$ cat /etc/default/qemu-kvm
# To disable qemu-kvm's page merging feature, set KSM_ENABLED=0 and
# sudo restart qemu-kvm
KSM_ENABLED=1
SLEEP_MILLISECS=200
# To load the vhost_net module, which in some cases can speed up
# network performance, set VHOST_NET_ENABLED to 1.
VHOST_NET_ENABLED=0
 
# Set this to 1 if you want hugepages to be available to kvm under
# /run/hugepages/kvm
KVM_HUGEPAGES=0

vhost_net was disabled by default for qemu-kvm. We were running all packets through userspace and qemu instead of passing them to the kernel directly as vhost_net does! That’s where the lag was coming from!

I acted immediately to rescue the victims. I made the huge, extremely complicated, full 1 byte change on all our compute nodes by modifying a VHOST_NET_ENABLED=0 to a VHOST_NET_ENABLED=1, restarted all virtual machines and finally, after days of constantly screaming in pain, the flapping between both load balancers stopped.

I did it! I saved them!

But I couldn’t stop here. I wanted to find out, who did that to the poor little load balancers. Who’s behind this conspiracy of crippled network latency?

I knew there was only one way to finally catch the guy. I set a trap. I installed a fresh, clean, virgin Ubuntu 14.04 in a virtual machine and then, well, then I waited — for apt-get install qemu-kvm to finish:

sudo apt-get install qemu-kvm
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  acl cpu-checker ipxe-qemu libaio1 libasound2 libasound2-data libasyncns0
  libbluetooth3 libboost-system1.54.0 libboost-thread1.54.0 libbrlapi0.6
  libcaca0 libfdt1 libflac8 libjpeg-turbo8 libjpeg8 libnspr4 libnss3
  libnss3-nssdb libogg0 libpulse0 librados2 librbd1 libsdl1.2debian
  libseccomp2 libsndfile1 libspice-server1 libusbredirparser1 libvorbis0a
  libvorbisenc2 libxen-4.4 libxenstore3.0 libyajl2 msr-tools qemu-keymaps
  qemu-system-common qemu-system-x86 qemu-utils seabios sharutils
Suggested packages:
  libasound2-plugins alsa-utils pulseaudio samba vde2 sgabios debootstrap
  bsd-mailx mailx
The following NEW packages will be installed:
  acl cpu-checker ipxe-qemu libaio1 libasound2 libasound2-data libasyncns0
  libbluetooth3 libboost-system1.54.0 libboost-thread1.54.0 libbrlapi0.6
  libcaca0 libfdt1 libflac8 libjpeg-turbo8 libjpeg8 libnspr4 libnss3
  libnss3-nssdb libogg0 libpulse0 librados2 librbd1 libsdl1.2debian
  libseccomp2 libsndfile1 libspice-server1 libusbredirparser1 libvorbis0a
  libvorbisenc2 libxen-4.4 libxenstore3.0 libyajl2 msr-tools qemu-keymaps
  qemu-kvm qemu-system-common qemu-system-x86 qemu-utils seabios sharutils
0 upgraded, 41 newly installed, 0 to remove and 2 not upgraded.
Need to get 3631 kB/8671 kB of archives.
After this operation, 42.0 MB of additional disk space will be used.
Do you want to continue? [Y/n] <
...
Setting up qemu-system-x86 (2.0.0+dfsg-2ubuntu1.3) ...
qemu-kvm start/running
Setting up qemu-utils (2.0.0+dfsg-2ubuntu1.3) ...
Processing triggers for ureadahead (0.100.0-16) ...
Setting up qemu-kvm (2.0.0+dfsg-2ubuntu1.3) ...
Processing triggers for libc-bin (2.19-0ubuntu6.3) ...

And then, I let the trap snap:

$ cat /etc/default/qemu-kvm
# To disable qemu-kvm's page merging feature, set KSM_ENABLED=0 and
# sudo restart qemu-kvm
KSM_ENABLED=1
SLEEP_MILLISECS=200
# To load the vhost_net module, which in some cases can speed up
# network performance, set VHOST_NET_ENABLED to 1.
VHOST_NET_ENABLED=0
 
# Set this to 1 if you want hugepages to be available to kvm under
# /run/hugepages/kvm
KVM_HUGEPAGES=0

I could not believe it! It was Ubuntu’s own default setting. Ubuntu, the very foundation of our cloud decided that despite all modern hardware supporting vhost_net to turn it off by default. Ubuntu was convicted and I could finally rest.


This is the end of my detective story. I found and arrested the criminal Ubuntu default setting and were able to prevent him from further crippling our virtual network latencies.

Please feel free to leave comments and ask questions about details of my journey. I’am already negotiating to sell the movie rights. But maybe there will be another season of OpenStack Crime Investigation in the future. So stay tuned on codecentric Blog.

Footnotes

1. Eh, and because I lost them.

Kommentare

  • A Dude

    Thank you for an entertaining series of articles. Perhaps you could find someone to do an “artist’s impression” of the missing iotop screenshots?

    • Lukas Pustina

      Hi Dude,
      I’m glad you enjoyed it. As you correctly assumed, I’m not an artist and can’t get hold of one right now 🙂

      Anyway, if you start iotop yourself, just imagine that there are qemu-kvm processes using more than 100% of the CPUs. There’s nothing more to it.

  • Felix

    How did changing VHOST_NET_ENABLED affect the system given that vhost_net was already loaded and active before?
    All VHOST_NET_ENABLED does is to tell the upstart job whether to load the kernel module or not.

  • Simon

    Hello Lukas! Thanks for the detailed story, that was fun to read. I’ve got a few questions for you:
    – Have you submitted a bug to Launchpad to switch VHOST_NET_ENABLED from 0 to 1 by default ? If confirmed, it looks like a problem that affects many people…
    – I’m a bit confused by the fact that the vhost_net module was already loaded and that your qemu command line contains “vhost=on,vhostfd=27”. IIUC this indicates that it should be using vhost_net [1]. What am I missing here?

    [1] https://www.redhat.com/archives/libvir-list/2010-May/msg00668.html

  • Lukas Pustina

    @Simon and @Felix

    You guys are right about the effects of changing the default setting. Unfortunately, my notes do not show any hints explaining your observation, so I cannot give you a definitiv answer. I cannot rule out that in the process of analyzing the problem, I loaded the modules manually. At the end, changing the setting, rebooting the host, and restarting all virtual machines did fix my problem.

  • Konstantin

    10. April 2015 von Konstantin

    Hi Lukas,

    very interesting topic, thanks for sharing it!
    I have a couple of questions and I hope you can help me here:

    1. How did you configure your network setup (Neutron I believe) for loadbalancers? The default Neutron behavior to deny all kind of IP-spoofing, which means it doesn’t allow any packets leaving an instance, if the source IP and/or MAC address differs from the address which was assigned to the instance by the OpenStack. Or do you simply use the source-NAT and use the loadbalancer’s IP instead?

    2. What is the maximum network throughput can you reach with vhost setup now? I mean here, the throughput which you see when you originate the traffic from external networks. In my case I see the bottleneck is at vhost process, because when I push traffic (1-2Gbps) to my instance, I see in the top output, that vhosts processes are taking 50-60% of CPU time (and actually from the CPU, which my loadbalancer should use )

    • Lukas Pustina

      Hello Konstantin, you’re welcome, thanks for reading 🙂

      1. You’re right, Neutron does not permit IP spoofing. What you can do, though, is to use “Address Pairs” which effectively allow you use VRRP. Please see Implementing High Availability Instances with Neutron using VRRP for a comprehensive introduction.

      2. We did make network throughput measurements using iperf between VMs on the same as well as on different compute nodes. In the first case we achieved 10 Gbit/s and in the second case we saturated our 2 Gbit/s bonded port channels. Unfortunately, we did not check the CPU utilization.

      I hope this helps. In case you have any further questions, do not hesitate to contact me.

Comment

Your email address will not be published. Required fields are marked *