I have got a massive KVM installation with hundreds of guests runnings dozens of different OSes, and have also noticed multiple qcow2 corruption bugs. All my guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the version from kvm-72 to fix the BSOD. kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact registry corruption error: http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_... This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the qcow2 performance regression caused by the default writethrough caching policy) but it randomly triggers an even worse bug: the moment I shut down a guest by typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk image with mostly NUL bytes (!) which completely destroys it. I am familiar with the qcow2 format and apparently this 4kB block seems to be an L2 table with most entries set to zero. I have had to restore at least 6 or 7 disk images from backup after occurences of that bug. My intuition tells me this may be the qcow2 code trying to allocate a cluster to write a new L2 table, but not noticing the allocation failed (represented by a 0 offset), and writing the L2 table at that 0 offset, overwriting the qcow2 header. Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted to its kvm-72 version. Basically qcow2 in kvm-73 or newer is completely unreliable. -marc --
Hi Marc, You should not take qemu-devel out of the CC list. This is where the bugs need to be fixed, they aren't KVM specific. I'm quoting your complete mail to forward it to where it belongs. I think the corruption is a completely unrelated bug. I would suspect it was introduced in one of Gleb's patches in December. Adding him to CC. Kevin --
Ow! That's a really serious bug. How many of us have regular hourly backups of our disk images? And how many of us are running databases or mail servers on our VMs, where even restoring from a recent backup is a harmful event? I've not noticed this bug reported by Marc, probably because I nearly always finish a KVM session by killing it, either because I'm testing or because KVM locks up occasionally and needs kill -9 :-( And because I've not used any KVM since kvm-72 in production until recently, only for testing my personal VMs. I must say, _thank goodness_ that the bug I reported occurs at boot time, and caused me to revert the qcow2 code. I'm now running a crticial VM on kvm-83 with reverted qcow2. Sure it's risky as there's no reason to believe kvm-83 is "stable", but there's no reason to believe any other version of KVM is especially stable either - there's no stabilising bug fix only branch that I'm aware of. If I hadn't had the boot time bug which I reported, I could have unrecoverable corruption instead from Marc's bug. For the time being, I'm going to _strongly_ advise my VM using professional clients to never, *ever* use qcow2 except for snapshot testing. Unfortunately the other delta/growable formats seem to be even less reliable, because they're not used much, so they should be avoided too. This corruption plus the data integrity/durability issues on host failure are a big deal. Even with kvm-72, I'm nervous about qcow2 now. Just because a bug hasn't caused obvious guest failures, doesn't mean it's not happening. Is there a way to restructure the code and/or how it works so it's My intuition says it's important to identify the cause of this, as it might not be qcow2 but the AIO code going awry with a random offset when closing down, e.g. if there's a use-after-free bug. Marc.. this is quite a serious bug you've reported. Is there a reason you didn't report it earlier? -- Jamie --
There's ad-hoc one w/out formal releases. But...never been closer ;-) http://thread.gmane.org/gmane.comp.emulators.kvm.devel/28179 thanks, -chris --
Because I only started hitting that bug a couple weeks ago after I am seriously concerned about the general design of qcow2. The code base is more complex than it needs to be, the format itself is susceptible to race conditions causing cluster leaks when updating some internal datastructures, it gets easily fragmented, etc. I am considering implementing a new disk image format that supports base images, snapshots (of the guest state), clones (of the disk content); that has a radically simpler design & code base; that is always consistent "on disk"; that is friendly to delta diffing (ie. space-efficient when used with ZFS snapshots or rsync); and that makes use of checksumming & replication to detect & fix corruption of critical data structures (ideally this should be implemented by the filesystem, unfortunately ZFS is not available everywhere :D). I believe the key to achieve these (seemingly utopian) goals is to represent a disk "image" as a set of sparse files, 1 per snapshot/clone. -marc --
Both qcow2 and vmdk have the ability to keep 'external' snapshots. In addition to what you wrote, qcow2 is missing journal for its meta data and also performs poorly because of complex meta data and sync calls. We might use vmdk format or VHD as a base for the future high performing, safe image format for qemu --
I didn't see any mention of this in QEMU's documentation. One of the most annoying features of qcow2 is "savevm" storing all VM snapshots You'll want to validate VHD carefully. I tested it just yesterday (with kvm-83), and "qemu-img convert" does not correctly unpack my VHD image (from Microsoft Virtual PC) to raw, compared with the unpacked version from MSVPC's own conversion tool. There's some patches which greatly improve the VHD support; I'm not sure if they're in kvm-83. -- Jamie --
I know but they don't implement one feature I cited: clones, or "writable snapshots", which I would like implemented with support for deduplication. Base images / backing files are too limited because they have to be managed by the enduser and there is no deduplication Neither vmdk nor vhd satisfy my requirements: not always consistent on disk, no possibility of detecting/correcting errors, susceptible to fragmentation (affects vmdk, not sure about vhd), and possibly others. Jamie: yes in an ideal world, the storage virtualization layer could make use of the host's filesystem or block layer snapshotting/cloning features, but in the real world too few OSes implement these. -marc --
When I read it, I thought the code was remarkably compact for what it does, although I agree that the leaks, fragmentation and inconsistency on crashes are serious. From elsewhere it sounds like the refcount You have just described a high quality modern filesystem or database engine; both would certainly be far more complex than qcow2's code. Especially with checksumming and replication :) ZFS isn't everywhere, but it looks like everyone wants to clone ZFS's best features everywhere (but not it's worst feature: lots of memory required). You can already do this, if your filesystem supports snapshotting. On Linux hosts, any filesystem can snapshot by using LVM underneath it (although it's not pretty to do). A few experimental Linux filesystems let you snapshot at the filesystem level. A feature you missed in the utopian vision is sharing backing store for equal parts of files between different snapshots _after_ they've been written in separate branches (with the same data), and also among different VMs. It's becoming stylish to put similarity detection in the filesystem somewhere too :-) -- Jamie --
I am not able to reproduce this. After more then hundred boot linux; generate disk io; quit loops all I've got is an image with 7 leaked blocks and couple of filesystem corruptions that were fixed by fsck. -- Gleb. --
The type of activity occuring in the guest is most likely an important factor determining the probability of the bug occuring. So you should try running guest OSes I remember having been affected by it: Windows 2003 SP2 x64. And now that I think about it, I don't recall any other guest OS having been a victim of that bug... coincidence ? Other factors you might consider when trying to reproduce: the qcow2 images that ended up being corrupted had a backing file (a read-only qcow2 image); NPT was in use; the host filesystem was xfs; my command line was: $ qemu-system-x86_64 -name xxx -monitor stdio -vnc xxx:xxx -hda hda -net nic,macaddr=xx:xx:xx:xx:xx:xx,model=rtl8139 -net tap -boot c -cdrom "" -cpu qemu64 -m 1024 -usbdevice tablet -marc --
And the probability of that bug occuring seems less than 1% (I only witnessed 6 or 7 occurences out of about a thousand shutdown events). Also, contrary to what I said I am *not* sure whether the "quit" monitor command was used or not. Instead the guests may have been using ACPI to shut themselves down. -marc --
