OpenStack Compute (nova)

"Timed out waiting for thing ... to become ACTIVE" causes tempest-dsvm-* failures

Bug #1254890 reported by Davanum Srinivas (DIMS) on 2013-11-25

114

This bug affects 23 people

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Invalid	Undecided	Unassigned
Havana	Fix Released	Undecided	Unassigned	OpenStack Compute (nova) 2013.2.4
neutron	Invalid	Undecided	Unassigned
tempest	Fix Released	Low	Unassigned

Bug Description

Separate out bug from:
https://bugs.launchpad.net/neutron/+bug/1250168/comments/23

Logstash query from elastic-recheck:
message:"Details: Timed out waiting for thing" AND message:"to become ACTIVE"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIG1lc3NhZ2U6XCJ0byBiZWNvbWUgQUNUSVZFXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA4NDc1MzM2MDR9

See original description

Tags:

Davanum Srinivas (DIMS) (dims-v) on 2013-11-25

affects:

neutron → nova

Davanum Srinivas (DIMS) (dims-v) on 2013-11-25

Changed in nova:
importance:	Undecided → High

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-03:

Looking at
http://logs.openstack.org/08/59108/1/gate/gate-tempest-dsvm-neutron-large-ops/694acc7/logs/

http://paste.openstack.org/show/54362/ - "Waiting for" messages stop at 14:17:38.668 (should be one every second). Looks like stuck in call_until_true loop. then it wakes up after 5 mins after the sleep and decides to get out of the loop w/o checking one last time if the condition is true. curiously server becomes "ACTIVE" at "2013-12-03T14:22:07Z"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-12-03: Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/59769

Changed in tempest:
assignee:	nobody → Davanum Srinivas (DIMS) (dims-v)
status:	New → In Progress

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-03: Re: "Timed out waiting for thing" causes neutron-large-ops failures

Some more investigation in bug 1232303.

Abhishek Chanda (abhishek-i) on 2013-12-10

Changed in nova:
status:	New → Triaged

Sean Dague (sdague) on 2013-12-12

Changed in tempest:
importance:	Undecided → Low

Alan Pevec (apevec) on 2013-12-13

summary:

- "Timed out waiting for thing" causes neutron-large-ops failures
+ "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-12-14: Re: "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures

http://bit.ly/1gwADji is the logstash url I am using

Eugene Nikanorov (enikanorov) on 2013-12-16

Changed in neutron:
assignee:	nobody → Eugene Nikanorov (enikanorov)

Revision history for this message

John Dickinson (notmyname) wrote on 2013-12-17:

seems to also cause gate-tempest-dsvm-full to fail

http://logs.openstack.org/32/58332/2/gate/gate-tempest-dsvm-full/25f0684/console.html

Revision history for this message

Sean Dague (sdague) wrote on 2013-12-18:

This is currently the 2nd highest reset problem after the SSH bug - http://status.openstack.org/elastic-recheck/

Changed in nova:
importance:	High → Critical

Davanum Srinivas (DIMS) (dims-v) on 2013-12-18

Changed in tempest:
assignee:	Davanum Srinivas (DIMS) (dims-v) → nobody
status:	In Progress → Confirmed

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-18:

I think i see at least one pattern.

1) Looking at log [1]

Found this line:
Details: Timed out waiting for thing 2da86cc2-0c74-4712-8e7c-36280d231279 to become ACTIVE

2) Looking at the n-cpu-log, find the request id [2]

2013-12-18 19:29:21.909 AUDIT nova.compute.manager [req-3333ca13-880e-4df7-84bf-35b68dd4490a demo demo] [instance: 2da86cc2-0c74-4712-8e7c-36280d231279] Starting instance...

3) Find the last instance of the request id
2013-12-18 19:29:22.792 DEBUG nova.openstack.common.processutils [req-3333ca13-880e-4df7-84bf-35b68dd4490a demo demo] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf mount /dev/nbd6 /tmp/openstack-vfs-localfs_CB_RP execute /opt/stack/new/nova/nova/openstack/common/processutils.py:147

Could mount be hung? given that /dev/nbd6 is being reused (see [3])?

[1] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/console.html
[2] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/logs/screen-n-cpu.txt.gz
[3] http://logs.openstack.org/30/62530/2/gate/gate-tempest-dsvm-neutron-pg/2e085b3/logs/screen-n-cpu.txt.gz#_2013-12-18_19_28_32_684

Other logs that has the same issue:
http://logs.openstack.org/43/59743/12/check/check-tempest-dsvm-neutron-pg/689a38c/logs/screen-n-cpu.txt.gz#_2013-12-18_15_11_50_960
http://logs.openstack.org/85/61085/2/check/check-tempest-dsvm-neutron/bb95e64/logs/screen-n-cpu.txt.gz#_2013-12-18_12_14_07_163

Mark McClain (markmcclain) on 2013-12-19

Changed in neutron:
status:	New → Invalid

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2013-12-19:

I can confirm DIMS findings.
The instance being spawned got stuck on mounting the fs

Eugene Nikanorov (enikanorov) on 2013-12-19

Changed in neutron:
assignee:	Eugene Nikanorov (enikanorov) → nobody

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-20:

looks like users are seeing it too
https://ask.openstack.org/en/question/6808/new-vm-instance-stuck-in-spawning-state/

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-20:

#10

I think i'll reopen this bug for the stale nbd connection problem and leave this bug as catch all for other related "Timed out" issues.

https://bugs.launchpad.net/nova/+bug/973413

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2013-12-21:

#11

I'm putting it back to neutron. Most of the failures with this bug were seen with neutron-pg or neutron-(pg)-isolated jobs.
Still it has something to do with neutron, may be in indirect way.

Changed in neutron:
status:	Invalid → Incomplete
assignee:	nobody → Eugene Nikanorov (enikanorov)

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-12-22:

#12

I have been investigating neutron's connection as well, since I had an exceptionally high failure rate with neutron jobs.
In neutron logs there is nothing which might point to a failure leading to an instance not being spawn. More precisely, I have been looking for neutron API calls not returning or returning with error codes which might have caused the thread processing network info asynchronously to never end. However, my research was inconclusive.

Failure vs build names [1] shows a neutron job joint 1st in the highest ranked failing jobs. I am not sure however whether these numbers can confirm a skew in the failure rate towards neutron.

[1] http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIG1lc3NhZ2U6XCJ0byBiZWNvbWVcIiBBTkQgIChtZXNzYWdlOlwiQUNUSVZFXCIgT1IgbWVzc2FnZTpcImluLXVzZVwiIE9SIG1lc3NhZ2U6XCJhdmFpbGFibGVcIilcbiIsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJtb2RlIjoic2NvcmUiLCJhbmFseXplX2ZpZWxkIjoiYnVpbGRfbmFtZSJ9

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2013-12-22:

#13

Almost each and every failure with 'waiting for thing' that I've seen is related to postgress jobs. I think that fact is worth digging into.

By the way, I have also checked sys logs for the nbd devices, unsuccessfull unmounting of which is the direct cause of the issue.
Every unmount operation seem to result in the same logs being produced, but for one of them unmounting fails with EBUSY in user mode.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-22:

#14

Eugene,

That umount failure seems to go away with this patch - https://review.openstack.org/#/c/63304/ - please do take a look.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-24:

#15

"check experimental" now runs additional jobs (https://review.openstack.org/#/c/63810/). The jobs show 63304 is not good (i.e. "umount -l" and "qemu-nbd -n" do not help). We need to find another way.

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2013-12-24:

#16

I'm thinking, can we purely simulate the issue by mounting/unmounting fs with qemu-nbd? I mean without nova or devstack at all?

It could be helpfull to know what state this ndb came into once unmounting has failed. And what needs to be done to reuse it.

It could also make sense perform some sanity check before trying to mount it. Say, we mount it in a separate thread and wait for few seconds. If mounting did not succeed, kill the thread and find another device.
What do you think?

Eugene Nikanorov (enikanorov) on 2013-12-30

Changed in neutron:
status:	Incomplete → Invalid
assignee:	Eugene Nikanorov (enikanorov) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-12-31: Fix merged to nova (master)

#17

Reviewed: https://review.openstack.org/64383
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd3f96e91581f465a52d10a212eb51b147dc38b5
Submitter: Jenkins
Branch: master

commit dd3f96e91581f465a52d10a212eb51b147dc38b5
Author: Davanum Srinivas <email address hidden>
Date: Sun Dec 29 19:18:04 2013 -0500

Fix for qemu-nbd hang

    NBD device once used seem to run into intermittent trouble when
    used with mount repeatedly. Adding code to explicitly flush the
    device buffers using 'blockdev --flushbufs'.

    Closes-Bug: #973413
    Partial-Bug: #1254890
    Change-Id: I2b7053b9a069d6e82f6f6baf9ad480efa4388d91

Russell Bryant (russellb) on 2014-01-07

Changed in nova:
milestone:	none → icehouse-2

Revision history for this message

Jay Pipes (jaypipes) wrote on 2014-01-09: Re: "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures

#18

I think the status should be reverted from Triaged to Confirmed, since we no longer seem to have a solid suggested fix or root cause for this.

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2014-01-09:

#19

Jay, but we also don't see it appearing as well. Do you think we need to keep it in opened state?

Russell Bryant (russellb) on 2014-01-09

Changed in nova:
status:	Triaged → Confirmed

Russell Bryant (russellb) on 2014-01-09

summary:	- "Timed out waiting for thing" causes tempest-dsvm-neutron-* failures + "Timed out waiting for thing" causes tempest-dsvm-* failures
description:	updated

Revision history for this message

Russell Bryant (russellb) wrote on 2014-01-13: Re: "Timed out waiting for thing" causes tempest-dsvm-* failures

#20

https://etherpad.openstack.org/p/nova-gate-issue-tracking

I think this issue will effectively be resolved by https://review.openstack.org/#/c/65805/

Changed in nova:
assignee:	nobody → Russell Bryant (russellb)

Revision history for this message

Russell Bryant (russellb) wrote on 2014-01-16:

#21

example sysstat showing that the test node is maxed out - http://logs.openstack.org/81/66081/1/gate/gate-tempest-dsvm-postgres-full/57dd449/logs/screen-sysstat.txt.gz

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-18: Fix proposed to nova (stable/havana)

#22

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/67662

Thierry Carrez (ttx) on 2014-01-22

Changed in nova:
milestone:	icehouse-2 → icehouse-3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-25: Fix merged to nova (master)

#23

Reviewed: https://review.openstack.org/66740
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a0891ad0af316f5423c106e11ca2af7b17b76dd3
Submitter: Jenkins
Branch: master

commit a0891ad0af316f5423c106e11ca2af7b17b76dd3
Author: Davanum Srinivas <email address hidden>
Date: Tue Jan 14 19:29:46 2014 -0500

Additional check for qemu-nbd hang

    /sys/block/*device*/pid check is not enough. I see that the unix
    socket used by the device my be stuck as well, so let's add another
    check for the path to the unix socket for the device as well to
    figure out if the device is free. Complain loud and clear that the
    qemu-nbd is leaking resources.

    Change-Id: I28cedffba7a9915ef6f7888989e40e4d0de475c6
    Closes-Bug: #973413
    Partial-Bug: #1254890

Ken'ichi Ohmichi (oomichi) on 2014-01-27

Changed in tempest:
assignee:	nobody → Ken'ichi Ohmichi (oomichi)

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2014-01-27: Re: "Timed out waiting for thing" causes tempest-dsvm-* failures

#24

I investigated this problem from http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/ .

At 2014-01-26 22:04:03.329, the following traceback is written on screen-n-cpu.txt.gz.

Traceback (most recent call last):
  File "/opt/stack/new/nova/nova/compute/manager.py", line 3844, in _attach_volume
    encryption=encryption)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1166, in attach_volume
    disk_info)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1124, in volume_driver_method
    return method(connection_info, *args, **kwargs)
  File "/opt/stack/new/nova/nova/openstack/common/lockutils.py", line 249, in inner
    return f(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/volume.py", line 305, in connect_volume
    % (host_device))
NovaException: iSCSI device not found at /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d-lun-1

However "iscsiadm --login" run without any errors before the above traceback:
2014-01-26 22:03:47.496 iscsiadm -m node -T iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d -p 127.0.0.1:3260 --login

So I cannot find the cause why /dev/disk/by-path/ link did not appear.

Changed in tempest:
assignee:	Ken'ichi Ohmichi (oomichi) → nobody

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-01-27:

#25

I found out on friday that every time an hang is observed causing bug 1254890 there is a kernel crash.
This happens at least on neutron jobs and and is related to namespace operations.

This is the kernel crash dump from the logs: http://paste.openstack.org/show/61869/

Revision history for this message

Masayuki Igawa (igawa) wrote on 2014-01-27:

#26

I've also investigated this bug. I've noticed that many same kernel messages in the syslog[1].

Jan 26 22:07:03 localhost kernel: [ 2800.881458] connection4:0: detected conn error (1020)
Jan 26 22:07:06 localhost kernel: [ 2803.885615] connection4:0: detected conn error (1019)
Jan 26 22:07:09 localhost kernel: [ 2806.889353] connection4:0: detected conn error (1019)
:
over 60 times.(connection4:0 only. other connections appeared only one time)

I think "4" is the volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d's sid because of this log:
"2014-01-26 22:03:48.840 DEBUG nova.virt.libvirt.volume [req-a142bdbd-df70-4c07-8df9-0b5b27970925 tempest.scenario.manager-tempest-1690504423-user tempest.scenario.manager-tempest-1690504423-tenant] iscsiadm ('--rescan',): stdout=Rescanning session [sid: 4, target: iqn.2010-10.org.openstack:volume-6f8b43b1-5c3b-4919-8dc2-7041e0966f7d, portal: 127.0.0.1,3260]"[2]

However, I don't know why this message appeared many times...

[1] http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/logs/syslog.txt.gz
[2] http://logs.openstack.org/36/69236/2/check/check-tempest-dsvm-full/8820082/logs/screen-n-cpu.txt.gz#_2014-01-26_22_03_49_006

Revision history for this message

Russell Bryant (russellb) wrote on 2014-01-27:

#27

Filed a new bug to split out the cases of waiting for a volume to become in-use/active from waiting for an instance to become ACTIVE.

https://bugs.launchpad.net/cinder/+bug/1273292

Russell Bryant (russellb) on 2014-01-27

description:	updated
summary:	- "Timed out waiting for thing" causes tempest-dsvm-* failures + "Timed out waiting for thing ... to become ACTIVE" causes tempest-dsvm-* + failures

Revision history for this message

Russell Bryant (russellb) wrote on 2014-01-27:

#28

Based on the kernel findings, I'm looking into getting the kernel upgraded on our test nodes.

Right now they run Ubuntu Precise, but there are newer kernels available: https://wiki.ubuntu.com/Kernel/LTSEnablementStack

I think we should try getting linux-generic-lts-saucy installed.

description:

updated

Christopher Yeoh (cyeoh-0) on 2014-01-28

tags:

added: gate-failure

Revision history for this message

Mark McLoughlin (markmc) wrote on 2014-02-07:

#29

See also bug #1277494 - in this case, a thread appears to get stuck in libvirt driver's to_xml() function

To see if this is the issue with a given failure do:

grep '\(Start\|End\) to_xml' screen-n-cpu.txt.gz

if the last result from this grep is 'Start to_xml()', it could be the same problem

Would be good to figure out whether this is just a once off underlying cause of a timeout, or more common

Joe Gordon (jogo) on 2014-02-11

tags:

added: testing

Russell Bryant (russellb) on 2014-02-21

Changed in nova:
assignee:	Russell Bryant (russellb) → nobody
importance:	Critical → High
milestone:	icehouse-3 → none

Revision history for this message

Dolph Mathews (dolph) wrote on 2014-04-09:

#30

Was the error message for this revised from "ACTIVE" to "available", or are those different states? e.g.

2014-04-09 16:14:00.987 | Traceback (most recent call last):
2014-04-09 16:14:00.988 | File "tempest/test.py", line 122, in wrapper
2014-04-09 16:14:00.988 | return f(self, *func_args, **func_kwargs)
2014-04-09 16:14:00.988 | File "tempest/scenario/test_minimum_basic.py", line 122, in test_minimum_basic_scenario
2014-04-09 16:14:00.988 | self.cinder_create()
2014-04-09 16:14:00.988 | File "tempest/scenario/test_minimum_basic.py", line 64, in cinder_create
2014-04-09 16:14:00.988 | self.volume = self.create_volume()
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 345, in create_volume
2014-04-09 16:14:00.988 | self.status_timeout(client.volumes, volume.id, 'available')
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 184, in status_timeout
2014-04-09 16:14:00.988 | not_found_exception=not_found_exception)
2014-04-09 16:14:00.988 | File "tempest/scenario/manager.py", line 247, in _status_timeout
2014-04-09 16:14:00.989 | raise exceptions.TimeoutException(message)
2014-04-09 16:14:00.989 | TimeoutException: Request timed out
2014-04-09 16:14:00.989 | Details: Timed out waiting for thing 6b99ebc1-9f80-4a38-86c8-7fe606126bd7 to become available

Source: http://logs.openstack.org/21/86321/9/check/check-grenade-dsvm/deb2e1e/console.html

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-31: Fix merged to nova (stable/havana)

#31

Reviewed: https://review.openstack.org/67662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1681b957bfa236c69dc233bc7661fff9b5113ad0
Submitter: Jenkins
Branch: stable/havana

commit 1681b957bfa236c69dc233bc7661fff9b5113ad0
Author: Davanum Srinivas <email address hidden>
Date: Sun Dec 29 19:18:04 2013 -0500

Fix for qemu-nbd hang

    NBD device once used seem to run into intermittent trouble when
    used with mount repeatedly. Adding code to explicitly flush the
    device buffers using 'blockdev --flushbufs'.

NOTE(dmllr): Also merged regression fix:

Consolidate the blockdev related filters

        Seeing a "/usr/local/bin/nova-rootwrap: Unauthorized command" error
        in the logs when "blockdev --flushbufs" is executed because of the
        existing blockdev in compute.filters. We need to merge both into a
        single RegExpFilter

Change-Id: Ic323ff00e5c23786a6e376e67a4ad08f92708aef

(cherry-picked from dd3f96e91581f465a52d10a212eb51b147dc38b5)
(cherry-picked from 31b2791e6be1768c410de5fa32283fcb637bda57)

    Closes-Bug: #973413
    Partial-Bug: #1254890
    Change-Id: I2b7053b9a069d6e82f6f6baf9ad480efa4388d91

tags:

added: in-stable-havana

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-09-06:

#32

I believe this has been fixed with upgrades to trusty and newer libvirt. please reopen if needed.

Changed in nova:
status:	Confirmed → Fix Committed
Changed in tempest:
status:	Confirmed → Fix Committed

Matthew Treinish (treinish) on 2014-09-09

Changed in tempest:
status:	Fix Committed → Fix Released

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-11:

#33

We are still seeing hits for this in elastic-recheck

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-11:

#34

May now be a duplicate of 1357476

Changed in nova:
importance:	High → Critical
status:	Fix Committed → Confirmed

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-11:

#35

marked as critical because 22 hits in elastic-recheck in 24 hours

Joe Gordon (jogo) on 2014-09-17

Changed in nova:
milestone:	none → juno-rc1

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-18:

#36

No hits in the gate queue in a few days looks like this may be fixed?

Changed in nova:
importance:	Critical → Undecided
status:	Confirmed → Incomplete

Revision history for this message

John Garbutt (johngarbutt) wrote on 2014-09-18:

#37

Removing from juno-rc1 as I don't think this should block the juno release.

Changed in nova:
milestone:	juno-rc1 → none

Matt Riedemann (mriedem) on 2014-10-16

Changed in nova:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.