Re: qcow2 corruption observed, fixed by reverting old change

Previous thread: [PATCH v3 0/6] ATS capability support for Intel IOMMU by Yu Zhao on Thursday, February 12, 2009 - 5:50 am. (20 messages)

Next thread: [PATCH 0/6] [v2] kvm: fix hot remove assigned device with IOMMU by Han, Weidong on Thursday, February 12, 2009 - 11:57 pm. (1 message)
From: Marc Bevand
Date: Thursday, February 12, 2009 - 11:41 pm

I have got a massive KVM installation with hundreds of guests runnings dozens of
different OSes, and have also noticed multiple qcow2 corruption bugs. All my
guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28
x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors).

My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run
kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the
version from kvm-72 to fix the BSOD.

kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact
registry corruption error:
http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_...
This bug is also fixed by reverting block-qcow2.c to the version from kvm-72.

I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the
qcow2 performance regression caused by the default writethrough caching policy)
but it randomly triggers an even worse bug: the moment I shut down a guest by
typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk
image with mostly NUL bytes (!) which completely destroys it. I am familiar with
the qcow2 format and apparently this 4kB block seems to be an L2 table with most
entries set to zero. I have had to restore at least 6 or 7 disk images from
backup after occurences of that bug. My intuition tells me this may be the qcow2
code trying to allocate a cluster to write a new L2 table, but not noticing the
allocation failed (represented by a 0 offset), and writing the L2 table at that
0 offset, overwriting the qcow2 header.

Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted
to its kvm-72 version.

Basically qcow2 in kvm-73 or newer is completely unreliable.

-marc

--

From: Kevin Wolf
Date: Friday, February 13, 2009 - 4:16 am

Hi Marc,

You should not take qemu-devel out of the CC list. This is where the
bugs need to be fixed, they aren't KVM specific. I'm quoting your
complete mail to forward it to where it belongs.


I think the corruption is a completely unrelated bug. I would suspect it
was introduced in one of Gleb's patches in December. Adding him to CC.

Kevin
--

From: Jamie Lokier
Date: Friday, February 13, 2009 - 9:23 am

Ow!  That's a really serious bug.  How many of us have regular hourly
backups of our disk images?  And how many of us are running databases
or mail servers on our VMs, where even restoring from a recent backup
is a harmful event?

I've not noticed this bug reported by Marc, probably because I nearly
always finish a KVM session by killing it, either because I'm testing
or because KVM locks up occasionally and needs kill -9 :-(

And because I've not used any KVM since kvm-72 in production until
recently, only for testing my personal VMs.

I must say, _thank goodness_ that the bug I reported occurs at boot
time, and caused me to revert the qcow2 code.  I'm now running a
crticial VM on kvm-83 with reverted qcow2.  Sure it's risky as there's
no reason to believe kvm-83 is "stable", but there's no reason to
believe any other version of KVM is especially stable either - there's
no stabilising bug fix only branch that I'm aware of.

If I hadn't had the boot time bug which I reported, I could have
unrecoverable corruption instead from Marc's bug.

For the time being, I'm going to _strongly_ advise my VM using
professional clients to never, *ever* use qcow2 except for snapshot
testing.

Unfortunately the other delta/growable formats seem to be even less
reliable, because they're not used much, so they should be avoided too.

This corruption plus the data integrity/durability issues on host
failure are a big deal.  Even with kvm-72, I'm nervous about qcow2 now.
Just because a bug hasn't caused obvious guest failures, doesn't mean
it's not happening.

Is there a way to restructure the code and/or how it works so it's

My intuition says it's important to identify the cause of this, as it
might not be qcow2 but the AIO code going awry with a random offset
when closing down, e.g. if there's a use-after-free bug.

Marc..  this is quite a serious bug you've reported.  Is there a
reason you didn't report it earlier?

-- Jamie 
--

From: Chris Wright
Date: Friday, February 13, 2009 - 11:43 am

There's ad-hoc one w/out formal releases.  But...never been closer ;-)

http://thread.gmane.org/gmane.comp.emulators.kvm.devel/28179

thanks,
-chris
--

From: Marc Bevand
Date: Friday, February 13, 2009 - 11:31 pm

Because I only started hitting that bug a couple weeks ago after

I am seriously concerned about the general design of qcow2. The code
base is more complex than it needs to be, the format itself is
susceptible to race conditions causing cluster leaks when updating
some internal datastructures, it gets easily fragmented, etc.

I am considering implementing a new disk image format that supports
base images, snapshots (of the guest state), clones (of the disk
content); that has a radically simpler design & code base; that is
always consistent "on disk"; that is friendly to delta diffing (ie.
space-efficient when used with ZFS snapshots or rsync); and that makes
use of checksumming & replication to detect & fix corruption of
critical data structures (ideally this should be implemented by the
filesystem, unfortunately ZFS is not available everywhere :D).

I believe the key to achieve these (seemingly utopian) goals is to
represent a disk "image" as a set of sparse files, 1 per
snapshot/clone.

-marc
--

From: Dor Laor
Date: Saturday, February 14, 2009 - 3:28 pm

Both qcow2 and vmdk have the ability to keep 'external' snapshots.
In addition to what you wrote, qcow2 is missing journal for its meta 
data and
also performs poorly because of complex meta data and sync calls.

We might use vmdk format or VHD as a base for the future high 
performing, safe
image format for qemu

--

From: Jamie Lokier
Date: Saturday, February 14, 2009 - 7:27 pm

I didn't see any mention of this in QEMU's documentation.  One of the
most annoying features of qcow2 is "savevm" storing all VM snapshots

You'll want to validate VHD carefully.  I tested it just yesterday
(with kvm-83), and "qemu-img convert" does not correctly unpack my VHD
image (from Microsoft Virtual PC) to raw, compared with the unpacked
version from MSVPC's own conversion tool.  There's some patches which
greatly improve the VHD support; I'm not sure if they're in kvm-83.

-- Jamie
--

From: Marc Bevand
Date: Sunday, February 15, 2009 - 12:56 am

I know but they don't implement one feature I cited: clones, or
"writable snapshots", which I would like implemented with support for
deduplication. Base images / backing files are too limited because
they have to be managed by the enduser and there is no deduplication

Neither vmdk nor vhd satisfy my requirements: not always consistent on
disk, no possibility of detecting/correcting errors, susceptible to
fragmentation (affects vmdk, not sure about vhd), and possibly others.

Jamie: yes in an ideal world, the storage virtualization layer could
make use of the host's filesystem or block layer snapshotting/cloning
features, but in the real world too few OSes implement these.

-marc
--

From: Jamie Lokier
Date: Saturday, February 14, 2009 - 7:37 pm

When I read it, I thought the code was remarkably compact for what it
does, although I agree that the leaks, fragmentation and inconsistency
on crashes are serious.  From elsewhere it sounds like the refcount

You have just described a high quality modern filesystem or database
engine; both would certainly be far more complex than qcow2's code.
Especially with checksumming and replication :)

ZFS isn't everywhere, but it looks like everyone wants to clone ZFS's
best features everywhere (but not it's worst feature: lots of memory
required).


You can already do this, if your filesystem supports snapshotting.  On
Linux hosts, any filesystem can snapshot by using LVM underneath it
(although it's not pretty to do).  A few experimental Linux
filesystems let you snapshot at the filesystem level.

A feature you missed in the utopian vision is sharing backing store
for equal parts of files between different snapshots _after_ they've
been written in separate branches (with the same data), and also among
different VMs.  It's becoming stylish to put similarity detection in
the filesystem somewhere too :-)

-- Jamie
--

From: Gleb Natapov
Date: Sunday, February 15, 2009 - 3:57 am

I am not able to reproduce this. After more then hundred boot linux; generate
disk io; quit loops all I've got is an image with 7 leaked blocks and
couple of filesystem corruptions that were fixed by fsck.

--
			Gleb.
--

From: Marc Bevand
Date: Sunday, February 15, 2009 - 4:46 am

The type of activity occuring in the guest is most likely an important
factor determining the probability of the bug occuring. So you should
try running guest OSes I remember having been affected by it: Windows
2003 SP2 x64.

And now that I think about it, I don't recall any other guest OS
having been a victim of that bug... coincidence ?

Other factors you might consider when trying to reproduce: the qcow2
images that ended up being corrupted had a backing file (a read-only
qcow2 image); NPT was in use; the host filesystem was xfs; my command
line was:

$ qemu-system-x86_64 -name xxx -monitor stdio -vnc xxx:xxx -hda hda
-net nic,macaddr=xx:xx:xx:xx:xx:xx,model=rtl8139 -net tap -boot c
-cdrom "" -cpu qemu64 -m 1024 -usbdevice tablet

-marc
--

From: Marc Bevand
Date: Sunday, February 15, 2009 - 4:54 am

And the probability of that bug occuring seems less than 1% (I only
witnessed 6 or 7 occurences out of about a thousand shutdown events).

Also, contrary to what I said I am *not* sure whether the "quit"
monitor command was used or not. Instead the guests may have been
using ACPI to shut themselves down.

-marc
--

Previous thread: [PATCH v3 0/6] ATS capability support for Intel IOMMU by Yu Zhao on Thursday, February 12, 2009 - 5:50 am. (20 messages)

Next thread: [PATCH 0/6] [v2] kvm: fix hot remove assigned device with IOMMU by Han, Weidong on Thursday, February 12, 2009 - 11:57 pm. (1 message)