OpenStack Compute (nova)

libvirtError: operation failed: cannot read cputime for domain

Bug #1372670 reported by Joe Gordon on 2014-09-22

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Eduardo Costa

Bug Description

2014-09-22 15:09:59.534 ERROR nova.compute.manager [req-74866bd8-5382-4354-89ca-7683a013d99c ServerDiskConfigTestJSON-218625483 ServerDiskConfigTestJSON-1404511620] [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] Setting instance vm_state to ERROR
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] Traceback (most recent call last):
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/compute/manager.py", line 6054, in _error_out_instance_on_exception
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] yield
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/compute/manager.py", line 3740, in resize_instance
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5838, in migrate_disk_and_power_off
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] self.power_off(instance, timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2474, in power_off
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] self._clean_shutdown(instance, timeout, retry_interval)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2434, in _clean_shutdown
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] (state, _max_mem, _mem, _cpus, _t) = dom.info()
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] result = proxy_call(self._autowrap, f, *args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] rv = execute(f, *args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] six.reraise(c, e, tb)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] rv = meth(*args, **kwargs)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1068, in info
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738] libvirtError: operation failed: cannot read cputime for domain
2014-09-22 15:09:59.534 25333 TRACE nova.compute.manager [instance: c09099c1-5dde-4ba9-8a8e-94ff75309738]

http://logs.openstack.org/71/123071/1/gate/gate-tempest-dsvm-postgres-full/7369ae8/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-09-22_15_09_59_534

I am seeing this stacktrace in some nova-compute logs, it looks like this is showing up in passing jobs, but I think that is because tempest doesn't always fail if a 'nova delete' fails.

Tags:

Revision history for this message

Sean Dague (sdague) wrote on 2014-09-22:

Is there a proposed next step here?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-09-23:

libvirtd.txt says

2014-09-22 15:09:59.532+0000: 17850: warning : qemuGetProcessInfo:1306 : cannot parse process status data
2014-09-22 15:09:59.532+0000: 17850: error : qemuDomainGetInfo:2509 : operation failed: cannot read cputime for domain

and it also says on the same thread earlier

2014-09-22 15:07:28.627+0000: 17850: warning : AppArmorSetFDLabel:919 : could not find path for descriptor /proc/self/fd/27, skipping

which seems to map to a missing /proc/self/fd/27

http://code.metager.de/source/xref/lib/virt/src/qemu/qemu_driver.c#1341

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-23:

The next step is to have someone familiar with libvirt to triage this. IMHO stacktraces should always be treated as bugs.

tags:	added: libvirt
Changed in nova:
status:	Incomplete → New

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-23:

message:"operation failed: cannot read cputime for domain" AND tags:"libvirt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwib3BlcmF0aW9uIGZhaWxlZDogY2Fubm90IHJlYWQgY3B1dGltZSBmb3IgZG9tYWluXCIgQU5EIHRhZ3M6XCJsaWJ2aXJ0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTE1MDI2NTgyNjEsIm1vZGUiOiJ0cmVuZCIsImFuYWx5emVfZmllbGQiOiJidWlsZF9zdGF0dXMifQ==

20 hits in 7 days, 80% failure runs.

This is maybe a race on delete, i.e. can't read the cputime b/c the domain xml (instance) is already deleted? What is it stacktracing on in the code?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-23:

Oh it's a race on resize, so the old domain xml is probably deleted and it's trying to read that and kaboom. Should have danpb take a look. Maybe this is trying to do some kind of recovery on failure and that's why it's not always resulting in a failed tempest run.

Sean Dague (sdague) on 2014-09-23

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

Solly Ross (sross-7) wrote on 2014-09-23:

Yeah, it looks like some sort of race condition. In qemu_driver.c in the libvirt source we have:

    if (!virDomainObjIsActive(vm)) {
        info->cpuTime = 0;
    } else {
        if (qemudGetProcessInfo(&(info->cpuTime), NULL, NULL, vm->pid, 0) < 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("cannot read cputime for domain"));
            goto cleanup;
        }
    }

Which calls qemudGetProcessInfo. Now here's the interesting part: essentially, qemudGetProcessInfo tries to open /proc/{pid}/stat or /proc/{pid}/task/stat. If it fails to asprintf the pid into the path, the method fails (although I doubt this is what's happening). Interestingly enough, if we fail to read from the path, we don't error out, we just return all zeros. The only other place we can error out is if we fail to fscanf properly. That implies that we're able to open the path for reading, but not able to parse it.

Are there any libvirt warnings before the stack trace?

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-27:

I didn't see any libvirt warnings in the nova logs prior to the stacktrace. but here are the libvirt logs as well:

http://logs.openstack.org/71/123071/1/gate/gate-tempest-dsvm-postgres-full/7369ae8/logs/libvirtd.txt.gz

Revision history for this message

Joe Gordon (jogo) wrote on 2014-10-09:

just saw this hit again: http://logs.openstack.org/00/122100/10/gate/gate-tempest-dsvm-neutron-full/0e71cd2/logs/screen-n-cpu.txt.gz?level=TRACE

Revision history for this message

Joe Gordon (jogo) wrote on 2014-10-30:

2014-10-08 16:23:21.083+0000: 12577: error : qemuMonitorIO:656 : internal error: End of file from monitor
2014-10-08 16:23:28.465+0000: 12579: warning : qemuGetProcessInfo:1306 : cannot parse process status data
2014-10-08 16:23:28.465+0000: 12579: error : qemuDomainGetInfo:2509 : operation failed: cannot read cputime for domain

http://logs.openstack.org/00/122100/10/gate/gate-tempest-dsvm-neutron-full/0e71cd2/logs/libvirtd.txt.gz

Changed in nova:
importance:	Low → High

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-10-31:

#10

That sequence of log messages show that the QEMU process is exiting while we are reading the CPU time. THat's not neccesarily bug - it all depends on whether we are *expecting* the QEMU process to be exiting or not. In general any libvirt API call on a domain can fail in this way if the guest shuts down while it it being performed.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-11-25:

#11

message:"migrate_disk_and_power_off" AND message:"libvirtError: operation failed: cannot read cputime for domain" AND tags:"screen-n-cpu.txt" AND module:"nova.compute.manager"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwibWlncmF0ZV9kaXNrX2FuZF9wb3dlcl9vZmZcIiBBTkQgbWVzc2FnZTpcImxpYnZpcnRFcnJvcjogb3BlcmF0aW9uIGZhaWxlZDogY2Fubm90IHJlYWQgY3B1dGltZSBmb3IgZG9tYWluXCIgQU5EIHRhZ3M6XCJzY3JlZW4tbi1jcHUudHh0XCIgQU5EIG1vZHVsZTpcIm5vdmEuY29tcHV0ZS5tYW5hZ2VyXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTY5Mzg3MzI3NDJ9

8 hits in 7 days, check queue only but on multiple changes, all failures.

Revision history for this message

Eduardo Costa (ecosta) wrote on 2014-11-29:

#12

Created a bug report for libvirt: https://bugzilla.redhat.com/show_bug.cgi?id=1169055
As said before, it seems to be a race condition between fopen() and fscanf() in libvirt. Proposed a patch as well.

Revision history for this message

Eduardo Costa (ecosta) wrote on 2014-12-01:

#13

Patch applied, libvirt bug closed. Now we just need to wait for the next libvirt release...

http://libvirt.org/git/?p=libvirt.git;a=commit;h=ff018e686a8a412255bc34d3dc558a1bcf74fac5

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-01-07:

#14

Is there anything we can do in the nova code to detect this and try to workaround or ignore it?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-01: Fix proposed to nova (master)

#15

Fix proposed to branch: master
Review: https://review.openstack.org/151953

Changed in nova:
assignee:	nobody → Eduardo Costa (ecosta)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-30: Fix merged to nova (master)

#16

Reviewed: https://review.openstack.org/151953
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=32c10850ae733c524e403592771b02491a96e67d
Submitter: Jenkins
Branch: master

commit 32c10850ae733c524e403592771b02491a96e67d
Author: Eduardo Costa <email address hidden>
Date: Sun Feb 1 19:49:31 2015 -0200

Workaround for race condition in libvirt

    Method "info" is affected by a race condition in libvirt.
    Methods "vcpus" and "memoryStats" are also affected, but their
    errors are handled gracefully by present code.

This patch detects the race and then performs a retry on error.

New code should avoid calling libvirt.virDomain.info directly.
Instead, method Host.get_domain_info() should be called.

A full fix for this bug would involve upgrading libvirt to
version >=1.2.11.

Change-Id: I4e3baceabdbc843635d07652abf80d39d58d9e06
Partial-Bug: #1372670

Eduardo Costa (ecosta) on 2015-10-12