open-vm-tools guest stats overflow
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
open-vm-tools (Debian) |
Fix Released
|
Unknown
|
|||
open-vm-tools (Ubuntu) |
Fix Released
|
Medium
|
Christian Ehrhardt | ||
Bionic |
Fix Released
|
Medium
|
Unassigned | ||
Cosmic |
Fix Released
|
Medium
|
Christian Ehrhardt |
Bug Description
[Impact]
* If running a 32 bit kernel (rare these days but still existing for some
upgraders until we full drop it) then /proc/vmstat has 32bit values
* These values can wrap at 32bit and the open-vm-tools will not "realize"
that as they assume only 64bit values.
* That causes "just" a spike in the stats being reported, but due to the
fact that there are higher level e.g. VM placement algorithms at work
consuming those numbers this can trigger a mass migration off that
node which in turn can make everything worse.
* Include the upstream fix to that problem to ensure people are not
affected by it.
[Test Case]
* This is a lot of effort to verify explicitly, but since the change is
small once the test is understood code review will in most cases be
enough.
- To trigger the error you'd need a VMWare Guest with 32 bit kernel
since i386 is no more mainstream the easiest way to get there is to
install from
http://
And then upgrade to Bionic.
- then the next thing you'd need to do is to check the stat values
to do so you can use the script attached to the bug [1]
Run it on the host via:
$ python query_vmguestst
localhost --user root --password <root password>
These numbers should never "go crazy" due to the wraparound.
- Once all this is set up you'd need to ramp up the numbers of e.g.
pgfaults to cause a wraparound - to do so essentially run a lot of
read I/O
This could be done with:
$ sudo mkdir /data1
$ sudo fio /tmp/seq-read.fio
While the config is:
$ cat /tmp/seq-read.fio
; Read 4 files with aio at different depths
[global]
ioengine=libaio
buffered=0
rw=read
bs=128k
size=128m
directory=/data1
iodepth=32
direct=1
time_based
runtime=60s
[file1]
[file2]
[file3]
[file4]
Obviously 60 seconds is not enough, and it is recommended to tune the
path and disk backing to your needs to run as fast as possible.
- At the same time run on the guest
$ cat /proc/vmstat | grep pgpgin
- At some point the numbers of the latter will wrap, without the fix
this will make the vmware observed stats spike to huge values.
[Regression Potential]
* Worst case the numbers we try to fix would get worse (due to the new
calculation being wrong). But that would only be "as bad as it is now".
Furthermore the code change is rather small.
Also 64bit wraparounds are not touched (I wonder why but lets stick to
the upstream code) but that means on 64bit systems (=most systems) this
is a no-op further reducing the risk for an regression.
[Other Info]
* taking the change was suggested by VMware who owns the tools as well as
most solutions consuming the stats, so we'd like to follow that
request.
---
Reported at Debian as well, see https:/
There is an unhandled overflow issue in open-vm-tools in the code for guest stats reporting. This cause artifacts (spikes) in rate stats, for example "Guest|Page In Rate per second". This issue only affects 32 bit builds of open-vm-tools.
We have a fix for 10.3.x at
https:/
The fix has also been backported to 10.2.5 in a special branch:
https:/
Thanks,
Oliver
Related branches
- Andreas Hasenack: Approve
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 76 lines (+54/-0)3 files modifieddebian/changelog (+7/-0)
debian/patches/lp-1793219-fix-stats-overflow.patch (+46/-0)
debian/patches/series (+1/-0)
- Robie Basak: Pending requested
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 76 lines (+54/-0)3 files modifieddebian/changelog (+7/-0)
debian/patches/lp-1793219-fix-stats-overflow.patch (+46/-0)
debian/patches/series (+1/-0)
description: | updated |
Changed in open-vm-tools (Debian): | |
status: | Unknown → New |
Changed in open-vm-tools (Debian): | |
status: | New → Fix Released |
Just FYI - I saw this but didn't get to it due to a business trip.
I didn't want to leave the impression it was lost, so I wanted to let you know that.