libvirt: "Instance failed to shutdown in 60 seconds." in the gate

Bug #1829896 reported by Matt Riedemann
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

I noticed today that when shutting down an instance, like for a snapshot test, it's sometimes taking a full 60 seconds to wait for the graceful shutdown before just destroying the guest:

http://logs.openstack.org/56/656656/12/check/nova-multi-cell/6a403ba/compute1/logs/screen-n-cpu.txt.gz#_May_16_19_47_26_017509

May 16 19:47:26.017509 ubuntu-bionic-rax-dfw-0006201136 nova-compute[17635]: INFO nova.virt.libvirt.driver [None req-bde576dc-5de2-4113-982d-025c65b97e8d tempest-MigrationsAdminTest-391674513 tempest-MigrationsAdminTest-391674513] [instance: 21c45b79-b0f4-48a8-8949-9f480beb5864] Instance failed to shutdown in 60 seconds.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Instance%20failed%20to%20shutdown%20in%2060%20seconds.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

This could contribute to overall job/test timeouts since a lot of tempest test wait operations are using the build_timeout config and set to 196 seconds. I'm not sure why these guests aren't stopping, but they should be since they shouldn't be doing anything. Looking at logstash affected branches it seems this started in Stein.

This is with libvirt 4.0.0-1ubuntu8.10 and qemu 1:2.11+dfsg-1ubuntu7.13.

I don't see any errors in the libvirtd logs, the shutdown request returns with code 0.

Tags: libvirt gate
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hmm, I'm completely puzzled as well, on the root cause. And indeed, the libvirtd log[1] is utterly boring with the shutdown request returing "0".

    2019-05-16 19:46:25.777+0000: 10502: debug : virThreadJobClear:121 : Thread 10502 (virNetServerHandleJob) finished job remoteDispatchDomainShutdown with ret=0

And I'm not even sure if I can reproduce this problem locally to track it down.

Nothing in the syslog: http://logs.openstack.org/56/656656/12/check/nova-multi-cell/6a403ba/compute1/logs/syslog.txt.gz

Maybe some clues in the guest boot log? Do we capture it? Let's look...okay, there is this compressed, system-wide journal log[2], maybe that has it? Let's extract it and examine:

  $> gunzip devstack.journal.xz.gz
  $> /lib/systemd/systemd-journal-remote <(xzcat ./devstack.journal.xz) -o output.journal
  Finishing after writing 6333 entries

`grep`ing around `sudo journalctl --file ./output.journal` didn't yield anything interesting either. Still not clear, I see a guest serial console log (from boot time).

[1]http://logs.openstack.org/56/656656/12/check/nova-multi-cell/6a403ba/compute1/logs/libvirt/libvirtd_log.txt.gz
[2] http://logs.openstack.org/56/656656/12/check/nova-multi-cell/6a403ba/compute1/logs/devstack.journal.xz.gz

Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe we should push a patch that dumps the guest console log when we timeout on the graceful shutdown, just as a WIP/debug patch to see if we can see what's going on in the guest.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/660766

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a guest console log from a timeout to shutdown in the gate:

http://paste.openstack.org/show/752002/

Revision history for this message
Chris Dent (cdent) wrote :

Any chance the looping to talk to the metadata server is blocking being able to respond to the poweroff in a timely and graceful fashion?

I don't have much to support that assertion other than that's where it is in the logs and it will be keep trying up to 20 and when I was working on etcd-compute if I didn't have the metadata service turned on, booting some vms would effectively block on that loop.

Revision history for this message
Matt Riedemann (mriedem) wrote :

> Any chance the looping to talk to the metadata server is blocking being able to respond to the poweroff in a timely and graceful fashion?

Maybe, I'm not sure. I thought cloud-init would eventually timeout if it couldn't get a response from the metadata API, like after 10 seconds or something, but I don't know if it would retry but clearly it is retrying. So yeah if it retries 20 times with 10 seconds between, that would be longer than we even have a build_timeout config in tempest (196 seconds) so it would run up against the power off.

I wonder if there is a way to configure that retry loop in cloud-init for our gate images...

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a devstack patch to configure nova by default to have shutdown_timeout=0 in gate runs:

https://review.opendev.org/670591

This will arguably mask the issue in the gate, but it's probably something we should have done to improve gate performance when graceful shutdowns were added years ago.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Interestingly I see hits in logstash on the tempest-full* jobs which configure devstack with FORCE_CONFIG_DRIVE=true which means the guest should have a config drive in it and shouldn't need to go to the metadata API. Maybe there is some regression in cloud-init?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/660766

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It is still happened 7 times in the last 7 day so I mark this as Confirmed.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.