Overview

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

2 Comments

Intro

As mentioned before — for example in Fabian’s The CenterDevice Cloud Architecture Revisited post from December 2014) — our document management product CenterDevice runs on top of infrastructure virtualized by OpenStack.
Where that older post was more application focused, this one covers a particularly nasty problem that plagued us for some time: Being unable to migrate virtual machines from one bare metal hypervisor host to another without interruption. By the end of this article you will see how we have overcome a series of obstacles on the way to successful live migrations in OpenStack Icehouse for KVM virtual machines using Ceph/Rados Block Device based volumes for data storage.

System setup

At the time of writing our cluster sports 12 bare metal servers. 6 of these are dedicated OpenStack compute nodes, with 4 more serving as Ceph storage cluster nodes. The remaining 2 are OpenStack controllers.
All storage is provided to virtual machines as OpenStack Cinder volumes backed by Ceph virtual block devices. One of the main reasons for this setup is that it allows for easy migration of virtual machines from one physical host to the next without also having to bring along large amounts of storage across the network.

Migrating Virtual Machines

OpenStack by default enables “regular” migrations, i. e. migrations where a virtual machine needs to be shut down to then be rebooted on another host. This incurs a service interruption inside the virtual machine. Ideally you would want to be able to seamlessly move the VM across physical servers without the OS and software inside it even noticing. Depending on the hypervisor type and the surrounding setup this is generally feasible.

With the instance to be migrated (the source VM) still running, its memory content is sent to the destination host. The source hypervisor keeps track of which memory pages are modified on the source while the transfer is in progress. Once the initial bulk transfer is complete, pages changed in the meantime are transferred again. This is done repeatedly with (ideally) ever smaller increments.

As long as the differences can be transferred faster than the source VM dirties memory pages, at some point the source VM gets suspended. Final differences are sent to the target host and an identical machine started there. At the same time the virtual network infrastructure takes care of all traffic being directed to the new virtual machine. Once the replacement machine is running, the suspended source instance is deleted. Usually the actual handover takes place so quickly and seamlessly that all but very time sensitive applications ever notice anything.

Since only memory is transferred, a prerequisite for this kind of live migration is shared storage, for which we use Ceph. OpenStack supports this, but you need to enable “true live migration” as described in the OpenStack Admin Guide. It boils down to adding the following to the /etc/nova/nova.conf file:

live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_TUNNELLED

Sounds easy enough, so where’s the catch?

Problem #1

With our cluster set up as described above, this is what happened when I tried to live-migrate a VM from one host to the next:

[daniel.schneller@control01]➜ nova list  
+--------------+--------+--------+------------+-------------+---------------------+  
| ID           | Name   | Status | Task State | Power State | Networks            |  
+--------------+--------+--------+------------+-------------+---------------------+  
| a1564ec8-... | dstest | ACTIVE | -          | Running     | testnet=192.168.1.2 |  
+--------------+--------+--------+------------+-------------+---------------------+ 
 
[daniel.schneller@control01]➜ nova live-migration dstest node10
 
[daniel.schneller@control01]tail -n20 /var/log/nova/nova-compute.log  
Live migration of instance a1564ec8-... to host node10 failed  
Traceback (most recent call last):  
  File "/usr/lib/python2.7/dist-packages/nova/api/openstack/compute/contrib/admin_actions.py", line 282, in _migrate_live  
    disk_over_commit, host)  
  File "/usr/lib/python2.7/dist-packages/nova/compute/api.py", line 94, in inner  
    return f(self, context, instance, *args, **kw)  
  File "/usr/lib/python2.7/dist-packages/nova/compute/api.py", line 1960, in live_migrate  
    disk_over_commit, instance, host)  
  File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 96, in live_migration  
    dest=dest))  
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 80, in call  
    return rpc.call(context, self._get_topic(topic), msg, timeout)  
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/__init__.py", line 102, in call  
    return _get_impl().call(cfg.CONF, context, topic, msg, timeout)  
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 712, in call  
    rpc_amqp.get_connection_pool(conf, Connection))  
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 368, in call  
    rv = list(rv)  
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 336, in __iter__  
    raise result  
RemoteError: Remote error: InvalidCPUInfo_Remote Unacceptable CPU info: CPU doesn't have compatibility.

Notice the last line. Apparently there is some difference between CPUs. So let us see what kinds of CPUs the hypervisors have (some lines removed for brevity). First the source host the virtual machine lives on at the moment:

[daniel.schneller@node05]cat /proc/cpuinfo  
vendor_id       : GenuineIntel  
cpu family      : 6  
model           : 45  
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz  
stepping        : 7  
cpuid level     : 13  
flags           : fpu … (many more)

Then its designated new home:

[daniel.schneller@node10]cat /proc/cpuinfo  
vendor_id       : GenuineIntel  
cpu family      : 6  
model           : 44  
model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz  
stepping        : 2  
cpuid level     : 11  
flags           : fpu … (not as many as above)

The source host CPU is of more recent vintage. Unless configured otherwise, KVM will map the underlying host CPU’s features into any virtual machine that gets started on it. This is good for performance, because the guest OS can better leverage the hardware’s power. But as a downside, for a live migration you can only use hosts that have identical or even more capable CPUs as a migration target; otherwise the guest operating system — not knowing the hardware was “hot swapped” underneath — might try to access features not present on the new host, leading to crashes. For regular migrations this is not a problem, because it involves rebooting the guest.

Fix #1

In our case we gladly accept a slightly smaller CPU feature set over the potentially slightly better performance, because with it comes full migration flexibility. To ensure CPU compatibility across all VMs and hypervisors we can instruct Nova/KVM to report a specific CPU model and set of features to guests. We can figure out which model that would ideally be with the following set of commands.

[daniel.schneller@control01] ~/tmp ➜ pdsh 'node0[1-9],node10' 'sudo virsh capabilities | xmllint --xpath "/capabilities/host/cpu" - > ~daniel.schneller/tmp/$(hostname).xml'  
[daniel.schneller@control01] ~/tmp ➜ cat node*.xml >> all-cpus.xml  
[daniel.schneller@control01] ~/tmp ➜ sudo virsh cpu-baseline all-cpus.xml  
<cpu mode='custom' match='exact'>  
  <model fallback='allow'>Westmere</model>  
  <vendor>Intel</vendor>  
  ...  
</cpu>

The first command assumes a few things:

  1. I can connect to all relevant hypervisor hosts via SSH
  2. I can do password-less sudo there
  3. xmllint is installed on them
  4. My home directory resides on shared storage
  5. ~/tmp exists.

On each node it queries the hypervisor with virsh capabilities, then extracts only the relevant CPU element from the XML. The result is written into a file per host. The second command then combines all the separate XML files into a single one. The third command then uses virsh’s built-in mechanism to resolve multiple sets of CPU capabilities into a baseline that they all support.

In our case we learned that Westmere describes the intersection of all host CPU features. So using Ansible I made sure that all hypervisors had the following entries in /etc/nova/nova-compute.conf:

[DEFAULT]  
compute_driver=libvirt.LibvirtDriver  
[libvirt]  
virt_type=kvm  
# Define custom CPU restriction to the lowest  
# common subset of features across all hypervisors.  
# Otherwise live migrations will fail when trying  
# to move from a more modern CPU to an older one.  
cpu_mode=custom  
cpu_model=Westmere

After that the nova-compute service needs to be restarted on all hypervisors. This can be done without affecting running virtual machines, because it only restarts the service that is (among other things) responsible for spawning new CPUs, but is not required for active VMs to function.

Problem #2

Unfortunately even with this obstacle out of the way the migration would still fail with the same error message as before. Turns out, there is a problem with the CPU comparison code in Nova. See Nova Bug Ticket #1082414 for details. It boils down to the wrong set of CPUs being compared – instead of checking whether the source’s virtual CPU can be supported by the target’s real CPU, the code compares the two physical CPUs for compatibility, bringing us back to square 1.

Fix #2

While the bug is going to be fixed in a newer (at the time of writing yet to be released) OpenStack version, the patch is too big to be back-ported to OpenStack Icehouse. So as an interim solution1 I simply disabled the broken check as discussed in comments #26 and following of the above mentioned bug.

Important: Once this patch is in place, nothing prevents you from migrating instances to incompatible hosts! Even though we specified a custom CPU model earlier (fix #1), virtual machines that were launched prior to that change cannot know about the new limitations! Before live-migrating any virtual machines you must make sure to reboot them once to make them pick up the new CPU type!

Problem #3

So. Now we should be able to live-migrate, right? Well… Wrong…!

The next problem came down to an unfortunate oversight on our part. Even though listed as a requirement in the Configure migrations chapter we did not ensure that the Nova instances directory (typically /var/lib/nova/instances) was mounted on shared storage. This led to the following error in the source hosts /var/lib/nova/nova-compute.log:

RemoteError: Remote error: RemoteError Remote error:  
InvalidSharedStorage_Remote node10 is not on shared storage: Live  
migration can not be used without shared storage.

To determine the presence of shared storage Nova performs a (too) simple check: It tries to create a temporary file in the instance directory of the virtual machine to be migrated and checks if that file can be seen at the same path on the destination host. In our case that naturally failed, because that path resides on a local drive on each hypervisor, even though the VM volumes reside on shared storage. Same as before, apparently this whole part of the code is going through major refactoring for future OpenStack releases, but that did not exactly help me.

Fix #3

I was already looking for the right spot to remove that check, too, when I came across this old mailing list thread “Live migration of VM using librbd and OpenStack”, discussing this exact issue. The final message in that thread conveniently has the right place identified already and a valuable hint thrown in for free:

Just for posterity, my ultimate solution was to patch nova on each compute host to always return True in _check_shared_storage_test_file (nova/virt/libvirt/driver.py)

This did make migration work with “nova live-migration”, with one caveat. Since Nova is assuming that /var/lib/nova/instances is on shared storage (and since I hard coded the check to say “yes, it really is”), it thinks the /var/lib/nova/instances/<domain> folder will exist at both source and destination, and makes no attempt to create it on the destination.

This is the complete patch we apply on new compute nodes (including both the CPU check mentioned above and the shared storage workaround):

--- libvirt/driver.py.orig  2014-08-21 19:20:10.000000000 +0200  
+++ libvirt/driver.py   2015-02-27 10:09:17.830455657 +0100  
@@ -4234,9 +4234,10 @@  
             disk_available_mb = \  
                     (disk_available_gb * units.Ki) - CONF.reserved_host_disk_mb
 
-        # Compare CPU  
-        source_cpu_info = src_compute_info['cpu_info']  
-        self._compare_cpu(source_cpu_info)  
+        # Compare CPU -- Daniel Schneller: Disabled due to  
+        # https://bugs.launchpad.net/nova/+bug/1082414  
+        # source_cpu_info = src_compute_info['cpu_info']  
+        # self._compare_cpu(source_cpu_info)
 
         # Create file on storage, to be checked on source host  
         filename = self._create_shared_storage_test_file()  
@@ -4399,11 +4400,22 @@
 
         Cannot confirm tmpfile return False.  
         """  
-        tmp_file = os.path.join(CONF.instances_path, filename)  
-        if not os.path.exists(tmp_file):  
-            return False  
-        else:  
-            return True  
+        # Daniel Schneller: Nova assumes live migration also  
+        # implies shared storage for instance metadata (libvirt.xml)  
+        # and checks this by creating a tempfile in that directory,  
+        # verifying it can be seen from source and destination of  
+        # the migration. This would prevent live migration for us  
+        # unnecessarily. We return True here, no matter what, faking  
+        # shared storage. Cleverly Nova itself even seems to copy  
+        # the instance metdata over again in a later step.  
+        # This will have to be reviewed in later OpenStack versions,  
+        # where improved handling has already been announced.  
+        return True  
+        #tmp_file = os.path.join(CONF.instances_path, filename)  
+        #if not os.path.exists(tmp_file):  
+        #    return False  
+        #else:  
+        #    return True
 
     def _cleanup_shared_storage_test_file(self, filename):  
         """Removes existence of the tmpfile under CONF.instances_path."""

As noted in the patch , in Icehouse Nova creates the console.log and libvirt.xml files on the destination hypervisor, provided the instance directory already exists. Also, since it assumes shared storage, it does not clean up the source directory once the migration is complete.

Finally!

With the above patches and modifications in place, live migration now works as follows:

  1. Determine the VMs UUID, e. g. with nova show or nova list.
  2. Pick the new destination host and create /var/lib/nova/instances/<VM-UUID>.
  3. Ensure the directory has the correct ownership chown nova:nova /var/lib/nova/instances/<VM-UUID>
  4. Perform the actual migration: nova live-migration <VM-UUID> <new-host>
  5. Remove the old /var/lib/nova/instances/<VM-UUID> from the old host.

The time needed for the migration is usually in the range of several seconds, sometimes up to a few minutes. This primarily depends on the RAM size, its rate of change inside the virtual machine, and the speed of the network connecting source and destination hypervisors.

Limitations / Caveats

While the above procedure generally works flawlessly, the necessity for the manual creation and deletion of directories is unfortunate and a potential source of errors.

The CPU compatibility issue is less likely to cause trouble in the future. As we have full control over the VMs running in our cluster, we can make sure each VM gets rebooted at least once before it is migrated. And because we will most certainly not add new compute nodes with CPUs inferior to the Westmere models we presently have in the our servers, the baseline feature set now configured will work fine for the foreseeable future, too.

In the coming months we will therefore probably move /var/lib/nova/instances to CephFS which at the moment we only use for roaming home directories. Once we do that, the second part of the above patch can be reverted again.

Conclusion

In this post I compiled a comprehensive summary on how to enable true Live Migration with OpenStack Icehouse for KVM based virtual machines built on Ceph/RBD volumes. While the information presented is mostly available from other places on the Internet, having it all combined in one place will hopefully save someone else the tedious work of compiling it again.

Footnotes

  1. interim, adj. originally “provisional”, “limited”; in IT contexts often referring to the most permanent of all solutions. See also: Prototype 😉.

Daniel Schneller has been designing and implementing complex software and database systems for more than 15 years and is the author of the MySQL Admin Cookbook. His current job title is Principal Cloud Engineer at CenterDevice GmbH, where he focuses on OpenStack and Ceph based cloud technologies. He has given talks at FroSCon, Data2Day and DWX Developer Week among others.

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

Kommentare

  • Anyone know if this has improved at all in Juno?

    • 22. March 2015 von Daniel Schneller

      Juno will probably not help you here.

      • The issue about CPU features having to be compatible across machines is not really a bug, but a configuration choice. Knowing about it, specifying a common baseline CPU model is enough.
      • The CPU comparison bug is marked as fixed in Kilo, not Juno.
      • The shared storage thing for the instance directory is not really a bug, but more of a design decision. Not sure anyone is working on that.

Comment

Your email address will not be published. Required fields are marked *