A few weeks ago we were surprised by seemingly random I/O hangs on several virtual machines. Any attempt to write to their data volumes blocked, making the load average rise into the stratosphere, and — slightly more consequentially — make Elasticsearch or MongoDB freak out.
Looking at the hypervisor’s logs I noticed lots of these messages in dmesg and syslog:
... Mar 5 22:42:57 node06 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) Mar 5 22:42:59 node06 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) ...
This post describes the analysis, cause and remedy of the problem.
First Things First: A Quick Fix
The highest priority was to get the databases in the VMs running again. Google turned up a recommendation to drop the page cache on the hypervisor, because somehow the problem seemed to be memory fragmentation related and could be resolved this way at least temporarily:
% echo 1 > /proc/sys/vm/drop_caches
Indeed, doing that immediately made the VMs responsive again. In one case, Elasticsearch needed to be restarted inside the VM, but the cluster quickly recovered. However in the following days, the problem kept recurring, increasing in frequency and happening on several hypervisors.
Kernel Update to the Rescue (?)
I noticed we were running a kernel with a known problem in XFS, effectively it being more wasteful with memory than it needed to be for certain operations (see this 2013 XFS Mailing List Post). According to this Ubuntu Launchpad Issue that particular problem was fixed in a later Ubuntu kernel, so we installed the most recent one on the hypervisors. For a few days it seemed to fix the issue, because we did not see the hangs again. Unfortunately, it still came back, just a little later.
Intermediate measure: the xfs-guard
Doing more research I consistently came across the topic of memory fragmentation, not understanding why that could become a problem all of a sudden, because the machines had run with unchanged configurations, both inside the VMs and on the hypervisor, for several months. (It actually made sense after all, but at this point I had not yet understood the underlying mechanisms.) It turned out, dropping the page cache was not really necessary after all, dropping the slab cache (
echo 2 > /proc/sys/vm/drop_caches) was enough to get the VMs responsive. I think dropping the page cache also worked via a side effect of freeing enough (unrelated) memory for more slab allocations to succeed.
Knowing that this worked, to buy me some time, I quickly threw together a small “xfs-guard” daemon and installed it on the hypervisors. Basically, all it does is tail the syslog constantly, look for the error message and drop the slab cache repeatedly until the message stops repeating. You can find it on Github and on Ansible Galaxy.
Once that was deployed (at not a moment too soon: it kicked in for the first time just a few hours after rolling it out) I had a little more time to find the root cause.
Reading further, everything kept pointing towards file system fragmentation, and in turn (slab) memory fragmentation inside the kernel. I felt a bit reminiscent of the 90s, when running something like Norton Speed Disk to defragment FAT file systems was a pretty common task. I was reluctant to believe that this could still be a problem almost 30 years later, especially because XFS has a reputation for being pretty good at keeping fragmentation under control.
Norton Disk Doctor and Norton Speed Disk aren't the same thing pic.twitter.com/TSNgCpxobN
— Anatoly Shashkin 💾 (@dosnostalgic) August 8, 2016
XFS will, for example speculatively pre-allocate more space on disk than is actually requested, assuming that there will be more related data coming soon afterwards. This way it can place that data right next to the first chunk, instead of potentially having to put it elsewhere on disk. This and the fact that the file system in question only held very few, very large image files, provided to the VMs as block devices, made the fragmentation theory seem even more unlikely. I needed more data.
File systems must keep track of the allocated and free parts of a disk. High disk fragmentation result in lots of small pieces of occupied and free space to be kept track of. With file systems a part of the kernel, the necessary metadata structures are kept in kernel memory, which is managed by the slab allocator. Even with the most efficient data structures, at some point it becomes impossible to accommodate the metadata for an ever-increasing number of disk space fragments. When that happens, slab memory allocation fails, and the file system blocks writes altogether.
Fortunately, XFS provides some powerful tools to analyse and diagnose what it is doing under the hood. My first step was to run the XFS debugger “xfs_db” and tell it to report some high level information on overall fragmentation (
frag -f). Notice that all commands mentioned in this post can run with the file system mounted and operating normally, which is pretty neat. They will, however, create additional I/O and at least temporarily consume some memory.
% xfs_db -r -c "frag -f" /dev/md0p1 actual 42561387, ideal 1089, fragmentation factor 100.00%
After about 30s, it came back with the above output. 100% looks pretty scary, but according to the XFS documentation, the “fragmentation factor” can be misleading, as it tends to approach 100% quickly, without necessarily indicating a serious problem. There is a nice graph in this XFS FAQ entry explaining the details. The output shows the number of “extents” (individual contiguous ranges of data belonging to files); both the actual number and what XFS considers ideal. Even if seeing the fragmentation factor alone was not strictly enough, having more than 40 million pieces of data seemed pretty hefty, especially compared to the much smaller ideal value.
So apparently something was wrong, but still more information was needed. Next up, I wanted to figure out how fragmented the free space on the volume was, assuming this might cause writing new data to stall if somehow it could not be made to fit in the free areas.
Allocation Groups and Free Space
I learned that XFS splits up a big volume into “Allocation Groups” which can be considered their own (mostly) separate “sub file systems”. The number of allocation groups depends on how large the total XFS volume is. Their purpose is to allow parallelising file operations. This Novell Knowledge Base page has a nice summary and some helpful scripts to gather information about the free space and its fragmentation per allocation group. The XFS documentation goes into even more technical detail.
I ran a loop across all hypervisor’s XFS file systems and came up with nothing particularly helpful. There were vast amounts of free and contiguous space in all allocation groups. For completeness, I ran the same tests inside the VMs, also showing no significant free space fragmentation.
Next up was the analysis of the fragmentation of the disk image files themselves. With them having been created with a fixed size and shortly after the creation of the underlying XFS file system, I assumed there could not really be much fragmentation here, either. But as they say…
This is the command used to find the individual extents and their sizes allocated for a particular file on XFS:
% xfs_bmap -v ffc63a70.disk > ffc63a70-bmap-before-defrag.txt
It took a few minutes, despite being run on a pretty fast SSD RAID. While it was going, its resident memory usage ballooned up to about 3 GB. Turns out, it was a good idea to redirect the output to a file:
% wc -l ffc63a70-bmap-before-defrag.txt 34694400 ffc63a70-bmap-before-defrag.txt
Close to 35 million extents just for this single disk image. The .txt file alone was larger than 3 GB! The same general picture turned up for the other VM images across all servers. In total, each hypervisor had around 50 million extents, for just a handful of large files. This is the first few lines of one of the text files:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..255]: 2339629568..2339629823 24 (512..767) 256 1: [256..359]: 2339629208..2339629311 24 (152..255) 104 2: [360..5135]: 2339715400..2339720175 24 (86344..91119) 4776 3: [5136..5447]: 2339720256..2339720567 24 (91200..91511) 312 4: [5448..6143]: 2339721496..2339722191 24 (92440..93135) 696 5: [6144..7495]: 2339723200..2339724551 24 (94144..95495) 1352 6: [7496..8159]: 2339725560..2339726223 24 (96504..97167) 664 7: [8160..9167]: 2339727232..2339728239 24 (98176..99183) 1008
Apparently this heavy fragmentation into an immense number of tiny extents is a side effect of how qemu writes to these images. It appears similar to what is described in this 2014 XFS mailing list entry.
Aggressive flushing or direct writes defeat XFS’s fragmentation prevention features. The fact that the image files were allocated en-bloc right after the filesystem creation, does not matter in this case, because XFS apparently always uses sparse allocations. That means it does not actually claim all the space immediately, but only dynamically when actual writes happen. Seeing if we can do anything about this, while at the same to not sacrificing consistency in case of machine crashes, remains a problem to be solved another day.
With this in mind it’s understandable why the problems began only quite some time after the creation of the disk images. Up until then, XFS had dutifully managed the ever-growing number of extents per file, up to a tipping point where the memory needed to do so started to run out regularly. So even with xfs-guard in place, the constant write activity happening inside the virtual machines, would have just delayed the inevitable.
Defragmenting the disk images
XFS comes with a file system reorganizer tool “xfs_fsr” which can work on individual files (but also whole volumes). It tries to create a (less fragmented) copy of a file and replaces the original with that copy once it is done. This operation temporarily requires enough free space, ideally largely contiguous, to create the duplicate. Luckily, as determined earlier, we had large areas of unfragmented free space available. Moreover, a file cannot be in use when the reorganization takes place. So on a weekend I shut down the relevant VMs, one at a time, making their disk images available to xsr_fsr. This is a sample run for one disk image:
% time xfs_fsr -v ffc63a70.disk | tee ffc63a70.disk -xfs_fsr-output.txt ffc63a70.disk extents before:34694398 after:81 xfs_fsr -v ffc63a70.disk 0.13s user 732.62s system 55% cpu 22:01.67 total
For this particular image, of the originally 34 million extents, only 81 remained after the defragmentation. I can see how that is easier track 😏. The numbers for all other VM images were similar, every time the number of extents dropped by several orders of magnitude. Notice that due to the the way the reorganizer works, it will cause heavy I/O, so you might want to run this during off-hours, if your usage patterns allow it.
Conclusion and next steps
The memory allocation deadlocks, which had started to happen almost daily, have not occurred again. In addition to the xfs-guard watchdog, we are now preparing a script to report the number of extents per disk image to our monitoring system regularly. So even if we can’t find a way to prevent qemu from causing the fragmentation to grow, we can keep an eye on it and schedule defrags before it reaches critical levels again.
Interestingly, shortly after the VMs were restarted, I re-checked the fragmentation manually. One of the images had reached over 40.000 extents again already. However a few more samples taken after a couple of days showed that after the rather steep initial increase, it seemed taper off. Still, all the more reason to watch it closely.