"Timed out waiting for thing ... to become ACTIVE" causes tempest-dsvm-* failures

Bug #1254890 reported by Davanum Srinivas (DIMS) on 2013-11-25
114
This bug affects 23 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
Havana
Undecided
Unassigned
neutron
Undecided
Unassigned
tempest
Low
Unassigned
affects: neutron → nova
Changed in nova:
importance: Undecided → High

Looking at
http://logs.openstack.org/08/59108/1/gate/gate-tempest-dsvm-neutron-large-ops/694acc7/logs/

http://paste.openstack.org/show/54362/ - "Waiting for" messages stop at 14:17:38.668 (should be one every second). Looks like stuck in call_until_true loop. then it wakes up after 5 mins after the sleep and decides to get out of the loop w/o checking one last time if the condition is true. curiously server becomes "ACTIVE" at "2013-12-03T14:22:07Z"

Fix proposed to branch: master
Review: https://review.openstack.org/59769

Changed in tempest:
assignee: nobody → Davanum Srinivas (DIMS) (dims-v)
status: New → In Progress
Changed in nova:
status: New → Triaged
Sean Dague (sdague) on 2013-12-12
Changed in tempest:
importance: Undecided → Low
Alan Pevec (apevec) on 2013-12-13
summary: - "Timed out waiting for thing" causes neutron-large-ops failures
+ "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures
Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Sean Dague (sdague) wrote :

This is currently the 2nd highest reset problem after the SSH bug - http://status.openstack.org/elastic-recheck/

Changed in nova:
importance: High → Critical
Changed in tempest:
assignee: Davanum Srinivas (DIMS) (dims-v) → nobody
status: In Progress → Confirmed

I think i see at least one pattern.

1) Looking at log [1]

   Found this line:
   Details: Timed out waiting for thing 2da86cc2-0c74-4712-8e7c-36280d231279 to become ACTIVE

2) Looking at the n-cpu-log, find the request id [2]

   2013-12-18 19:29:21.909 AUDIT nova.compute.manager [req-3333ca13-880e-4df7-84bf-35b68dd4490a demo demo] [instance: 2da86cc2-0c74-4712-8e7c-36280d231279] Starting instance...

3) Find the last instance of the request id
2013-12-18 19:29:22.792 DEBUG nova.openstack.common.processutils [req-3333ca13-880e-4df7-84bf-35b68dd4490a demo demo] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf mount /dev/nbd6 /tmp/openstack-vfs-localfs_CB_RP execute /opt/stack/new/nova/nova/openstack/common/processutils.py:147

Could mount be hung? given that /dev/nbd6 is being reused (see [3])?

[1] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/console.html
[2] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/logs/screen-n-cpu.txt.gz
[3] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/logs/screen-n-cpu.txt.gz#_2013-12-18_19_28_32_684

Other logs that has the same issue:
http://logs.openstack.org/43/59743/12/check/check-tempest-dsvm-neutron-pg/689a38c/logs/screen-n-cpu.txt.gz#_2013-12-18_15_11_50_960
http://logs.openstack.org/85/61085/2/check/check-tempest-dsvm-neutron/bb95e64/logs/screen-n-cpu.txt.gz#_2013-12-18_12_14_07_163

Changed in neutron:
status: New → Invalid
Eugene Nikanorov (enikanorov) wrote :

I can confirm DIMS findings.
The instance being spawned got stuck on mounting the fs

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → nobody

I think i'll reopen this bug for the stale nbd connection problem and leave this bug as catch all for other related "Timed out" issues.

https://bugs.launchpad.net/nova/+bug/973413

Eugene Nikanorov (enikanorov) wrote :

I'm putting it back to neutron. Most of the failures with this bug were seen with neutron-pg or neutron-(pg)-isolated jobs.
Still it has something to do with neutron, may be in indirect way.

Changed in neutron:
status: Invalid → Incomplete
assignee: nobody → Eugene Nikanorov (enikanorov)

I have been investigating neutron's connection as well, since I had an exceptionally high failure rate with neutron jobs.
In neutron logs there is nothing which might point to a failure leading to an instance not being spawn. More precisely, I have been looking for neutron API calls not returning or returning with error codes which might have caused the thread processing network info asynchronously to never end. However, my research was inconclusive.

Failure vs build names [1] shows a neutron job joint 1st in the highest ranked failing jobs. I am not sure however whether these numbers can confirm a skew in the failure rate towards neutron.

[1] http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIG1lc3NhZ2U6XCJ0byBiZWNvbWVcIiBBTkQgIChtZXNzYWdlOlwiQUNUSVZFXCIgT1IgbWVzc2FnZTpcImluLXVzZVwiIE9SIG1lc3NhZ2U6XCJhdmFpbGFibGVcIilcbiIsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJtb2RlIjoic2NvcmUiLCJhbmFseXplX2ZpZWxkIjoiYnVpbGRfbmFtZSJ9

Eugene Nikanorov (enikanorov) wrote :

Almost each and every failure with 'waiting for thing' that I've seen is related to postgress jobs. I think that fact is worth digging into.

By the way, I have also checked sys logs for the nbd devices, unsuccessfull unmounting of which is the direct cause of the issue.
Every unmount operation seem to result in the same logs being produced, but for one of them unmounting fails with EBUSY in user mode.

Eugene,

That umount failure seems to go away with this patch - https://review.openstack.org/#/c/63304/ - please do take a look.

"check experimental" now runs additional jobs (https://review.openstack.org/#/c/63810/). The jobs show 63304 is not good (i.e. "umount -l" and "qemu-nbd -n" do not help). We need to find another way.

Eugene Nikanorov (enikanorov) wrote :

I'm thinking, can we purely simulate the issue by mounting/unmounting fs with qemu-nbd? I mean without nova or devstack at all?

It could be helpfull to know what state this ndb came into once unmounting has failed. And what needs to be done to reuse it.

It could also make sense perform some sanity check before trying to mount it. Say, we mount it in a separate thread and wait for few seconds. If mounting did not succeed, kill the thread and find another device.
What do you think?

Changed in neutron:
status: Incomplete → Invalid
assignee: Eugene Nikanorov (enikanorov) → nobody

Reviewed: https://review.openstack.org/64383
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd3f96e91581f465a52d10a212eb51b147dc38b5
Submitter: Jenkins
Branch: master

commit dd3f96e91581f465a52d10a212eb51b147dc38b5
Author: Davanum Srinivas <email address hidden>
Date: Sun Dec 29 19:18:04 2013 -0500

    Fix for qemu-nbd hang

    NBD device once used seem to run into intermittent trouble when
    used with mount repeatedly. Adding code to explicitly flush the
    device buffers using 'blockdev --flushbufs'.

    Closes-Bug: #973413
    Partial-Bug: #1254890
    Change-Id: I2b7053b9a069d6e82f6f6baf9ad480efa4388d91

Changed in nova:
milestone: none → icehouse-2

I think the status should be reverted from Triaged to Confirmed, since we no longer seem to have a solid suggested fix or root cause for this.

Eugene Nikanorov (enikanorov) wrote :

Jay, but we also don't see it appearing as well. Do you think we need to keep it in opened state?

Changed in nova:
status: Triaged → Confirmed
summary: - "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures
+ "Timed out waiting for thing" causes tempest-dsvm-* failures
description: updated
Changed in nova:
assignee: nobody → Russell Bryant (russellb)
Thierry Carrez (ttx) on 2014-01-22
Changed in nova:
milestone: icehouse-2 → icehouse-3

Reviewed: https://review.openstack.org/66740
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a0891ad0af316f5423c106e11ca2af7b17b76dd3
Submitter: Jenkins
Branch: master

commit a0891ad0af316f5423c106e11ca2af7b17b76dd3
Author: Davanum Srinivas <email address hidden>
Date: Tue Jan 14 19:29:46 2014 -0500

    Additional check for qemu-nbd hang

    /sys/block/*device*/pid check is not enough. I see that the unix
    socket used by the device my be stuck as well, so let's add another
    check for the path to the unix socket for the device as well to
    figure out if the device is free. Complain loud and clear that the
    qemu-nbd is leaking resources.

    Change-Id: I28cedffba7a9915ef6f7888989e40e4d0de475c6
    Closes-Bug: #973413
    Partial-Bug: #1254890

Changed in tempest:
assignee: nobody → Ken'ichi Ohmichi (oomichi)

I investigated this problem from http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/ .

At 2014-01-26 22:04:03.329, the following traceback is written on screen-n-cpu.txt.gz.

Traceback (most recent call last):
  File "/opt/stack/new/nova/nova/compute/manager.py", line 3844, in _attach_volume
    encryption=encryption)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1166, in attach_volume
    disk_info)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1124, in volume_driver_method
    return method(connection_info, *args, **kwargs)
  File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 249, in inner
    return f(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/volume.py", line 305, in connect_volume
    % (host_device))
NovaException: iSCSI device not found at /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d-lun-1

However "iscsiadm --login" run without any errors before the above traceback:
2014-01-26 22:03:47.496 iscsiadm -m node -T iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d -p 127.0.0.1:3260 --login

So I cannot find the cause why /dev/disk/by-path/ link did not appear.

Changed in tempest:
assignee: Ken'ichi Ohmichi (oomichi) → nobody

I found out on friday that every time an hang is observed causing bug 1254890 there is a kernel crash.
This happens at least on neutron jobs and and is related to namespace operations.

This is the kernel crash dump from the logs: http://paste.openstack.org/show/61869/

Masayuki Igawa (igawa) wrote :

I've also investigated this bug. I've noticed that many same kernel messages in the syslog[1].

Jan 26 22:07:03 localhost kernel: [ 2800.881458] connection4:0: detected conn error (1020)
Jan 26 22:07:06 localhost kernel: [ 2803.885615] connection4:0: detected conn error (1019)
Jan 26 22:07:09 localhost kernel: [ 2806.889353] connection4:0: detected conn error (1019)
  :
over 60 times.(connection4:0 only. other connections appeared only one time)

I think "4" is the volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d's sid because of this log:
 "2014-01-26 22:03:48.840 DEBUG nova.virt.libvirt.volume [req-a142bdbd-df70-4c07-8df9-0b5b27970925 tempest.scenario.manager-tempest-1690504423-user tempest.scenario.manager-tempest-1690504423-tenant] iscsiadm ('--rescan',): stdout=Rescanning session [sid: 4, target: iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d, portal: 127.0.0.1,3260]"[2]

However, I don't know why this message appeared many times...

[1] http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/logs/syslog.txt.gz
[2] http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/logs/screen-n-cpu.txt.gz#_2014-01-26_22_03_49_006

Russell Bryant (russellb) wrote :

Filed a new bug to split out the cases of waiting for a volume to become in-use/active from waiting for an instance to become ACTIVE.

https://bugs.launchpad.net/cinder/+bug/1273292

description: updated
summary: - "Timed out waiting for thing" causes tempest-dsvm-* failures
+ "Timed out waiting for thing ... to become ACTIVE" causes tempest-dsvm-*
+ failures
Russell Bryant (russellb) wrote :

Based on the kernel findings, I'm looking into getting the kernel upgraded on our test nodes.

Right now they run Ubuntu Precise, but there are newer kernels available: https://wiki.ubuntu.com/Kernel/LTSEnablementStack

I think we should try getting linux-generic-lts-saucy installed.

description: updated
tags: added: gate-failure
Mark McLoughlin (markmc) wrote :

See also bug #1277494 - in this case, a thread appears to get stuck in libvirt driver's to_xml() function

To see if this is the issue with a given failure do:

   grep '\(Start\|End\) to_xml' screen-n-cpu.txt.gz

if the last result from this grep is 'Start to_xml()', it could be the same problem

Would be good to figure out whether this is just a once off underlying cause of a timeout, or more common

Joe Gordon (jogo) on 2014-02-11
tags: added: testing
Changed in nova:
assignee: Russell Bryant (russellb) → nobody
importance: Critical → High
milestone: icehouse-3 → none
Dolph Mathews (dolph) wrote :

Was the error message for this revised from "ACTIVE" to "available", or are those different states? e.g.

2014-04-09 16:14:00.987 | Traceback (most recent call last):
2014-04-09 16:14:00.988 | File "tempest/test.py", line 122, in wrapper
2014-04-09 16:14:00.988 | return f(self, *func_args, **func_kwargs)
2014-04-09 16:14:00.988 | File "tempest/scenario/test_minimum_basic.py", line 122, in test_minimum_basic_scenario
2014-04-09 16:14:00.988 | self.cinder_create()
2014-04-09 16:14:00.988 | File "tempest/scenario/test_minimum_basic.py", line 64, in cinder_create
2014-04-09 16:14:00.988 | self.volume = self.create_volume()
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 345, in create_volume
2014-04-09 16:14:00.988 | self.status_timeout(client.volumes, volume.id, 'available')
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 184, in status_timeout
2014-04-09 16:14:00.988 | not_found_exception=not_found_exception)
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 247, in _status_timeout
2014-04-09 16:14:00.989 | raise exceptions.TimeoutException(message)
2014-04-09 16:14:00.989 | TimeoutException: Request timed out
2014-04-09 16:14:00.989 | Details: Timed out waiting for thing 6b99ebc1-9f80-4a38-86c8-7fe606126bd7 to become available

Source: http://logs.openstack.org/21/86321/9/check/check-grenade-dsvm/deb2e1e/console.html

Reviewed: https://review.openstack.org/67662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1681b957bfa236c69dc233bc7661fff9b5113ad0
Submitter: Jenkins
Branch: stable/havana

commit 1681b957bfa236c69dc233bc7661fff9b5113ad0
Author: Davanum Srinivas <email address hidden>
Date: Sun Dec 29 19:18:04 2013 -0500

    Fix for qemu-nbd hang

    NBD device once used seem to run into intermittent trouble when
    used with mount repeatedly. Adding code to explicitly flush the
    device buffers using 'blockdev --flushbufs'.

    NOTE(dmllr): Also merged regression fix:

        Consolidate the blockdev related filters

        Seeing a "/usr/local/bin/nova-rootwrap: Unauthorized command" error
        in the logs when "blockdev --flushbufs" is executed because of the
        existing blockdev in compute.filters. We need to merge both into a
        single RegExpFilter

        Change-Id: Ic323ff00e5c23786a6e376e67a4ad08f92708aef

    (cherry-picked from dd3f96e91581f465a52d10a212eb51b147dc38b5)
    (cherry-picked from 31b2791e6be1768c410de5fa32283fcb637bda57)

    Closes-Bug: #973413
    Partial-Bug: #1254890
    Change-Id: I2b7053b9a069d6e82f6f6baf9ad480efa4388d91

tags: added: in-stable-havana

I believe this has been fixed with upgrades to trusty and newer libvirt. please reopen if needed.

Changed in nova:
status: Confirmed → Fix Committed
Changed in tempest:
status: Confirmed → Fix Committed
Changed in tempest:
status: Fix Committed → Fix Released
Joe Gordon (jogo) wrote :

We are still seeing hits for this in elastic-recheck

Joe Gordon (jogo) wrote :

May now be a duplicate of 1357476

Changed in nova:
importance: High → Critical
status: Fix Committed → Confirmed
Joe Gordon (jogo) wrote :

marked as critical because 22 hits in elastic-recheck in 24 hours

Joe Gordon (jogo) on 2014-09-17
Changed in nova:
milestone: none → juno-rc1
Joe Gordon (jogo) wrote :

No hits in the gate queue in a few days looks like this may be fixed?

Changed in nova:
importance: Critical → Undecided
status: Confirmed → Incomplete
John Garbutt (johngarbutt) wrote :

Removing from juno-rc1 as I don't think this should block the juno release.

Changed in nova:
milestone: juno-rc1 → none
Matt Riedemann (mriedem) on 2014-10-16
Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers