libvirtError: operation failed: cannot read cputime for domain

Bug #1372670 reported by Joe Gordon
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Eduardo Costa

Bug Description

2014-09-22 15:09:59.534 ERROR nova.compute.manager [req-74866bd8-5382-4354-89ca-7683a013d99c ServerDiskConfigTestJSON-218625483 ServerDiskConfigTestJSON-1404511620] [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] Setting instance vm_state to ERROR
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] Traceback (most recent call last):
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/compute/manager.py", line 6054, in _error_out_instance_on_exception
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] yield
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/compute/manager.py", line 3740, in resize_instance
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5838, in migrate_disk_and_power_off
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] self.power_off(instance, timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2474, in power_off
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] self._clean_shutdown(instance, timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2434, in _clean_shutdown
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] (state, _max_mem, _mem, _cpus, _t) = dom.info()
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] result = proxy_call(self._autowrap, f, *args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] rv = execute(f, *args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] six.reraise(c, e, tb)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] rv = meth(*args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1068, in info
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] libvirtError: operation failed: cannot read cputime for domain
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738]

http://logs.openstack.org/71/123071/1/gate/gate-tempest-dsvm-postgres-full/7369ae8/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-09-22_15_09_59_534

I am seeing this stacktrace in some nova-compute logs, it looks like this is showing up in passing jobs, but I think that is because tempest doesn't always fail if a 'nova delete' fails.

Tags: libvirt
Revision history for this message
Sean Dague (sdague) wrote :

Is there a proposed next step here?

Changed in nova:
status: New → Incomplete
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

libvirtd.txt says

2014-09-22 15:09:59.532+0000: 17850: warning : qemuGetProcessInfo:1306 : cannot parse process status data
2014-09-22 15:09:59.532+0000: 17850: error : qemuDomainGetInfo:2509 : operation failed: cannot read cputime for domain

and it also says on the same thread earlier

2014-09-22 15:07:28.627+0000: 17850: warning : AppArmorSetFDLabel:919 : could not find path for descriptor /proc/self/fd/27, skipping

which seems to map to a missing /proc/self/fd/27

http://code.metager.de/source/xref/lib/virt/src/qemu/qemu_driver.c#1341

Revision history for this message
Joe Gordon (jogo) wrote :

The next step is to have someone familiar with libvirt to triage this. IMHO stacktraces should always be treated as bugs.

tags: added: libvirt
Changed in nova:
status: Incomplete → New
Revision history for this message
Matt Riedemann (mriedem) wrote :

message:"operation failed: cannot read cputime for domain" AND tags:"libvirt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwib3BlcmF0aW9uIGZhaWxlZDogY2Fubm90IHJlYWQgY3B1dGltZSBmb3IgZG9tYWluXCIgQU5EIHRhZ3M6XCJsaWJ2aXJ0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTE1MDI2NTgyNjEsIm1vZGUiOiJ0cmVuZCIsImFuYWx5emVfZmllbGQiOiJidWlsZF9zdGF0dXMifQ==

20 hits in 7 days, 80% failure runs.

This is maybe a race on delete, i.e. can't read the cputime b/c the domain xml (instance) is already deleted? What is it stacktracing on in the code?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh it's a race on resize, so the old domain xml is probably deleted and it's trying to read that and kaboom. Should have danpb take a look. Maybe this is trying to do some kind of recovery on failure and that's why it's not always resulting in a failed tempest run.

Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Solly Ross (sross-7) wrote :

Yeah, it looks like some sort of race condition. In qemu_driver.c in the libvirt source we have:

    if (!virDomainObjIsActive(vm)) {
        info->cpuTime = 0;
    } else {
        if (qemudGetProcessInfo(&(info->cpuTime), NULL, NULL, vm->pid, 0) < 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("cannot read cputime for domain"));
            goto cleanup;
        }
    }

Which calls qemudGetProcessInfo. Now here's the interesting part: essentially, qemudGetProcessInfo tries to open /proc/{pid}/stat or /proc/{pid}/task/stat. If it fails to asprintf the pid into the path, the method fails (although I doubt this is what's happening). Interestingly enough, if we fail to read from the path, we don't error out, we just return all zeros. The only other place we can error out is if we fail to fscanf properly. That implies that we're able to open the path for reading, but not able to parse it.

Are there any libvirt warnings before the stack trace?

Revision history for this message
Joe Gordon (jogo) wrote :

I didn't see any libvirt warnings in the nova logs prior to the stacktrace. but here are the libvirt logs as well:

http://logs.openstack.org/71/123071/1/gate/gate-tempest-dsvm-postgres-full/7369ae8/logs/libvirtd.txt.gz

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

2014-10-08 16:23:21.083+0000: 12577: error : qemuMonitorIO:656 : internal error: End of file from monitor
2014-10-08 16:23:28.465+0000: 12579: warning : qemuGetProcessInfo:1306 : cannot parse process status data
2014-10-08 16:23:28.465+0000: 12579: error : qemuDomainGetInfo:2509 : operation failed: cannot read cputime for domain

http://logs.openstack.org/00/122100/10/gate/gate-tempest-dsvm-neutron-full/0e71cd2/logs/libvirtd.txt.gz

Changed in nova:
importance: Low → High
Revision history for this message
Daniel Berrange (berrange) wrote :

That sequence of log messages show that the QEMU process is exiting while we are reading the CPU time. THat's not neccesarily bug - it all depends on whether we are *expecting* the QEMU process to be exiting or not. In general any libvirt API call on a domain can fail in this way if the guest shuts down while it it being performed.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Eduardo Costa (ecosta) wrote :

Created a bug report for libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=1169055
As said before, it seems to be a race condition between fopen() and fscanf() in libvirt. Proposed a patch as well.

Revision history for this message
Eduardo Costa (ecosta) wrote :

Patch applied, libvirt bug closed. Now we just need to wait for the next libvirt release...

http://libvirt.org/git/?p=libvirt.git;a=commit;h=ff018e686a8a412255bc34d3dc558a1bcf74fac5

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is there anything we can do in the nova code to detect this and try to workaround or ignore it?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/151953

Changed in nova:
assignee: nobody → Eduardo Costa (ecosta)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/151953
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=32c10850ae733c524e403592771b02491a96e67d
Submitter: Jenkins
Branch: master

commit 32c10850ae733c524e403592771b02491a96e67d
Author: Eduardo Costa <email address hidden>
Date: Sun Feb 1 19:49:31 2015 -0200

    Workaround for race condition in libvirt

    Method "info" is affected by a race condition in libvirt.
    Methods "vcpus" and "memoryStats" are also affected, but their
    errors are handled gracefully by present code.

    This patch detects the race and then performs a retry on error.

    New code should avoid calling libvirt.virDomain.info directly.
    Instead, method Host.get_domain_info() should be called.

    A full fix for this bug would involve upgrading libvirt to
    version >=1.2.11.

    Change-Id: I4e3baceabdbc843635d07652abf80d39d58d9e06
    Partial-Bug: #1372670

Eduardo Costa (ecosta)
Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.