open-vm-tools guest stats overflow

Bug #1793219 reported by Oliver Kurth on 2018-09-18
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
open-vm-tools (Debian)
New
Unknown
open-vm-tools (Ubuntu)
Medium
 Christian Ehrhardt 
Bionic
Medium
Unassigned
Cosmic
Medium
 Christian Ehrhardt 

Bug Description

[Impact]

 * If running a 32 bit kernel (rare these days but still existing for some
   upgraders until we full drop it) then /proc/vmstat has 32bit values

 * These values can wrap at 32bit and the open-vm-tools will not "realize"
   that as they assume only 64bit values.

 * That causes "just" a spike in the stats being reported, but due to the
   fact that there are higher level e.g. VM placement algorithms at work
   consuming those numbers this can trigger a mass migration off that
   node which in turn can make everything worse.

 * Include the upstream fix to that problem to ensure people are not
   affected by it.

[Test Case]

 * This is a lot of effort to verify explicitly, but since the change is
   small once the test is understood code review will in most cases be
   enough.
   - To trigger the error you'd need a VMWare Guest with 32 bit kernel
     since i386 is no more mainstream the easiest way to get there is to
     install from
     http://releases.ubuntu.com/16.04/ubuntu-16.04.5-server-i386.iso
     And then upgrade to Bionic.
   - then the next thing you'd need to do is to check the stat values
     to do so you can use the script attached to the bug [1]
     Run it on the host via:
     $ python query_vmgueststats.py --vmname <name of the vm> --host
       localhost --user root --password <root password>
     These numbers should never "go crazy" due to the wraparound.
   - Once all this is set up you'd need to ramp up the numbers of e.g.
     pgfaults to cause a wraparound - to do so essentially run a lot of
     read I/O
     This could be done with:
      $ sudo mkdir /data1
      $ sudo fio /tmp/seq-read.fio
     While the config is:
      $ cat /tmp/seq-read.fio
; Read 4 files with aio at different depths
[global]
ioengine=libaio
buffered=0
rw=read
bs=128k
size=128m
directory=/data1
iodepth=32
direct=1
time_based
runtime=60s

[file1]

[file2]

[file3]

[file4]

     Obviously 60 seconds is not enough, and it is recommended to tune the
     path and disk backing to your needs to run as fast as possible.

   - At the same time run on the guest
      $ cat /proc/vmstat | grep pgpgin

   - At some point the numbers of the latter will wrap, without the fix
     this will make the vmware observed stats spike to huge values.

[Regression Potential]

 * Worst case the numbers we try to fix would get worse (due to the new
   calculation being wrong). But that would only be "as bad as it is now".
   Furthermore the code change is rather small.
   Also 64bit wraparounds are not touched (I wonder why but lets stick to
   the upstream code) but that means on 64bit systems (=most systems) this
   is a no-op further reducing the risk for an regression.

[Other Info]

 * taking the change was suggested by VMware who owns the tools as well as
   most solutions consuming the stats, so we'd like to follow that
   request.

[1]: https://bugs.launchpad.net/ubuntu/+source/open-vm-tools/+bug/1793219/+attachment/5193417/+files/query_vmgueststats.py

---

Reported at Debian as well, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=909146 :

There is an unhandled overflow issue in open-vm-tools in the code for guest stats reporting. This cause artifacts (spikes) in rate stats, for example "Guest|Page In Rate per second". This issue only affects 32 bit builds of open-vm-tools.

We have a fix for 10.3.x at
https://github.com/vmware/open-vm-tools/commit/c7a186e204cdff46b5e02bcb5208ef8979eaf261

The fix has also been backported to 10.2.5 in a special branch:
https://github.com/vmware/open-vm-tools/tree/stable-10.2.5-stat-overflow-fix

Thanks,
Oliver

Related branches

Just FYI - I saw this but didn't get to it due to a business trip.
I didn't want to leave the impression it was lost, so I wanted to let you know that.

Hi Oliver,
I started to work on this for the upcoming release for now.
The change in general looks correct and rather safe to me, so I can go on for the development version for sure.

To address the issue in former releases we have to follow the SRU process [1] which requires some elements I need your help. Especially the "steps to reproduce" in this case are rather hard for me to come up with.

Did you come up with a reliable way to trigger such a wraparound to test this?
And if so, where "on the host side" would I check for these spikes.

[1]: https://wiki.ubuntu.com/StableReleaseUpdates

Changed in open-vm-tools (Ubuntu Bionic):
status: New → Confirmed
Changed in open-vm-tools (Ubuntu Cosmic):
status: New → Triaged
assignee: nobody →  Christian Ehrhardt  (paelzer)
importance: Undecided → Medium
Changed in open-vm-tools (Ubuntu Bionic):
importance: Undecided → Medium
Changed in open-vm-tools (Ubuntu Cosmic):
status: Triaged → In Progress

A PPA with the fix is building at [1], it is meant to augment the MP review, but whoever wants feel free to check this.

Packaging MP prepared and filed at [2]

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3434
[2]: https://code.launchpad.net/~paelzer/ubuntu/+source/open-vm-tools/+git/open-vm-tools/+merge/355530

The fix for Cosmic is uploaded for cosmic, due to the Beta ISO Freeze being in place is might tkae slightly longer than usual to build [1] and migrate.

For Bionic I really need your Help oliver to Describe steps to reproduce in some way - could you outline that a bit?

[1]: https://launchpad.net/ubuntu/+source/open-vm-tools/2:10.3.0-0ubuntu3

Oliver Kurth (okurth-1) wrote :

If I understand correctly, Bionic and Cosmic do not support 32 bit, is that right? So this can only be tested in xenial with a 32 bit VM.

I am not familiar enough with the issue to know how to reproduce this - I'll ask.

Oliver Kurth (okurth-1) wrote :

Okay, informed myself. The issue is hard to reproduce. A customer had this issue about once per week, it depends on how much memory paging in/out is occurring. The trouble though is that because these overflows create such huge numbers that statistics for the whole cluster get polluted. That is why we are pushing this fix.

Attached is a script that can be run on the ESX server. Usage:

python query_vmgueststats.py --vmname <name of the vm> --host localhost --user root --password <root password>

It will periodically query guest stats and print them out. If those numbers make sense, this is all to be expected for a successful test.

i386 is available up to the latest release, just no Desktop ISO images to slowly fade it out.
You can and should be able to test it with a Bionic/Cosmic.

To be able to follow your usual deployment path with an ISO you can as well use a Xenial 32bit ISO and "do-release-upgrade" to Bionic from there.

But TBH the commit states that the values would be 32bit even on 64bit kernels.
https://github.com/vmware/open-vm-tools/commit/c7a186e204cdff46b5e02bcb5208ef8979eaf261

I haven't checked if that is true, the fix is valid no matter what.
We only would like the extra testability for the SRU process.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package open-vm-tools - 2:10.3.0-0ubuntu3

---------------
open-vm-tools (2:10.3.0-0ubuntu3) cosmic; urgency=medium

  * d/p/lp-1793219-fix-stats-overflow.patch: fix potential overflow of 32 bit
    /proc values (LP: #1793219)

 -- Christian Ehrhardt <email address hidden> Mon, 24 Sep 2018 11:21:11 +0200

Changed in open-vm-tools (Ubuntu Cosmic):
status: In Progress → Fix Released
Oliver Kurth (okurth-1) wrote :

Here is a more detailed explanation of the bug:

Linux kernel defines /proc/vmstat::pgpgin as type unsigned long, so it is uint32 on 32 bit Linux, uint64 on 64 bit Linux.

VMware Tools reads /proc/vmstat::pgpgin and saves it as uint64 type on both 32 & 64 bit Linux.

When 64 bit kernel /proc/vmstat::pgpgin value overflows, VMware Tools can correctly calculates the difference by doing uint64 type subtraction.

When 32 bit kernel /proc/vmstat::pgpgin value overflows, the difference should be calculated by doing uint32 type subtraction, not uint64 type subtraction which would result in a fake huge difference. This is the bug that we have fixed.

Depends on Linux system memory usage, it could take many days for kernel /proc/vmstat::pgpgin value to overflow, so this bug is hard to reproduce.

description: updated

SRU template prepared and an MP for Bionic is up fro review.

Robie Basak (racb) wrote :

++ if (currentStat->value < previousStat->value &&
++ previousStat->value <= MAX_UINT32) {
++ valueDelta = (uint32)(currentStat->value) -
++ (uint32)(previousStat->value);

I'm not sure I follow how this is correct. The casting seems dubious, but acceptable since previousStat->value fits in 32 bits (we just checked it's less than MAX_UINT32) and currentStat->value has also been verified to be less than that.

But if this has happened because currentStat->value wrapped around 32 bits, then surely the correct valueDelta is currentStat->value + (MAX_UINT32 - previousStat->value)? If previousStat->value was MAX_UINT32 for example, and currentStat->value is now 0, then the valueDelta should be 1. But (0 - MAX_UINT32) will surely get promoted and then end up being assigned to a double as a negative number.

What am I missing?

Hello Oliver, or anyone else affected,

Accepted open-vm-tools into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/open-vm-tools/2:10.3.0-0ubuntu1~18.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in open-vm-tools (Ubuntu Bionic):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-bionic

I tested the proposed vmtools in a vsphere in general, but as outlined in the testing section it is hard.
I wasn't able to trigger the bug before the upgrade - so not seeing it after isn't too much of a hard confirm. But I already outlined that problem in the SRU template and was giving it a general health check that was ok.

Also I agree to Robie that the upstream code is better than it was before, but not quite perfect.
But we want to sync what we got from upstream and not create our own version.

I'll mark it as verified under the circumstances, but it is up to the SRU Team to let this a bit longer in -proposed if you think that is the right thing to do here.

@Oliver - you could take a look at Rbasaks great comment and consider rechecking and maybe fixing future upstream if it is an issue.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Changed in open-vm-tools (Debian):
status: Unknown → New
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.