test_baremetal_server_ops fails with timeout waiting for provision_state to change

Bug #1393099 reported by Jeremy Stanley
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Confirmed
Undecided
Unassigned
OpenStack-Gate
Fix Released
Undecided
Unassigned

Bug Description

Traceback (most recent call last):
  File "tempest/test.py", line 113, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_baremetal_basic_ops.py", line 124, in test_baremetal_server_ops
    self.boot_instance()
  File "tempest/scenario/manager.py", line 1133, in boot_instance
    timeout=CONF.baremetal.active_timeout)
  File "tempest/scenario/manager.py", line 1061, in wait_provisioning_state
    target_states=state, timeout=timeout)
  File "tempest/scenario/manager.py", line 1056, in _node_state_timeout
    raise exceptions.TimeoutException(msg)
TimeoutException: Request timed out
Details: Timed out waiting for node 830ac6d1-49a7-45ee-b15d-219f816cb6cc to reach provision_state state(s) ['active']

http://logs.openstack.org/28/134728/1/check/check-tempest-dsvm-ironic-pxe_ssh-nv/6e9daa1/console.html#_2014-11-15_22_39_45_432

Tags: gate-failure
Jeremy Stanley (fungi)
Changed in openstack-ci:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to elastic-recheck (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/134738

Revision history for this message
Jeremy Stanley (fungi) wrote :

26 job runs hit that query in the past 48 hours, all were ironic-based devstack jobs and 100% ended in job failure.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

There's either an issue with the qemu baremetal node coming up after power on, or it is coming up an unable to netboot. In either case, we're not able to see any of the early boot messages in the console log [1], so its hard to say. I did some quick tracing of the neutron port update and it doesn't look like we're racing anywhere around write out of DHCP and restart of dnsmasq.

It's worth noting I've been seeing similar mystery failures while keeping an eye on the *-tempest-dsvm-ironic-parallel jobs, which exercises the ironic boot path but many more times and in parallel.

[1] http://logs.openstack.org/28/134728/1/check/check-tempest-dsvm-ironic-pxe_ssh-nv/6e9daa1/logs/ironic-bm-logs/baremetalbrbm_0_console.txt.gz

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Looking closer, the VM is coming up, DHCP'ing but never hitting the TFTP server for its deploy kernel/ramdisk

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to elastic-recheck (master)

Reviewed: https://review.openstack.org/134738
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=13241cbb0a0ed821d04825b6631a2d42b0026aaf
Submitter: Jenkins
Branch: master

commit 13241cbb0a0ed821d04825b6631a2d42b0026aaf
Author: Jeremy Stanley <email address hidden>
Date: Sat Nov 15 23:19:08 2014 +0000

    Add query for test_baremetal_server_ops timeout

    Change-Id: Ieac80bbc37113f028d74c4d6e5570fca0ba84805
    Related-Bug: #1393099

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

https://review.openstack.org/#/c/135068/ adds sgabios to log early boot bios stuff to serial console, hopefully giving us some info from guest about whats going on with the failing tftp get

Changed in openstack-ci:
status: Incomplete → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

With early boot debug logging happening, I see the failed node boots, gets its Neutron managed DHCP IP, attempts to TFTP from HOST_IP, but cannot reach it. From what I can tell there is no race happening with Neutron: DHCP config is written out correctly (as evidenced by the machine booting, receiving IP and TFTP boot address, but unable to reach back to HOST_IP. I wonder if there is an OVS race or bug triggering under the hood...

net0: 52:54:00:70:f0:8e using 82540em on PCI00:04.0 (open)
  [Link:up, TX:0 TXE:0 RX:0 RXE:0]
Configuring (net0 52:54:00:70:f0:8e)...... ok
net0: 10.1.0.5/255.255.240.0 gw 10.1.0.1
Next server: 10.0.1.160
Filename: pxelinux.0
10.0.1.160/pxelinux.0............... Connection timed out (http://ipxe.or
No more network devices

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to system-config (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/135694

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to project-config (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/136439

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on project-config (master)

Change abandoned by Spencer Krum (<email address hidden>) on branch: master
Review: https://review.openstack.org/136439
Reason: We're doing a different thing.

Jeremy Stanley (fungi)
Changed in openstack-gate:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to system-config (master)

Reviewed: https://review.openstack.org/135694
Committed: https://git.openstack.org/cgit/openstack-infra/system-config/commit/?id=6cc7b4fd0749ddaafa6ddc07c53906248702f883
Submitter: Jenkins
Branch: master

commit 6cc7b4fd0749ddaafa6ddc07c53906248702f883
Author: Adam Gandelman <email address hidden>
Date: Wed Nov 19 10:47:36 2014 -0800

    Add iptables rule to log dropped packets

    Adds a rule to the bottom of the openstack-INPUT chain to log any
    packets that make it this far and are on their way to being rejected.
    This may help us find out why Devstack Ironic VM nodes occasionally cannot
    reach the host's TFTP server.

    Change-Id: I3980284a93f50ac8db16769f6ad87382aa55e898
    Related-bug: #1393099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack-gate (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/257637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack-gate (master)

Reviewed: https://review.openstack.org/257637
Committed: https://git.openstack.org/cgit/openstack-infra/devstack-gate/commit/?id=4d9b178a87cc86569f60d271ad2b9a68ac50bdfd
Submitter: Jenkins
Branch: master

commit 4d9b178a87cc86569f60d271ad2b9a68ac50bdfd
Author: Jim Rollenhagen <email address hidden>
Date: Mon Dec 14 16:56:11 2015 -0800

    Bump ironic timeout way up

    Ironic builds on slower hosts tend to time out while PXE booting the
    deploy ramdisk while transferring the image, producing great havoc on
    the gate. We're investigating various approaches to fixing this (for
    example, switching kvm's network driver didn't seem to help much), but
    in the meantime let's be good citizens of gate resources and let things
    run in full, even if PXE booting is slow.

    This bumps the ironic timeout to 600, which should be more than enough
    time for pxe booting to happen on slow hosts. It bumps both the value
    that Tempest needs and the value ironic uses to time out a deploy.

    As tempest defaults to 500 seconds for OS_TEST_TIMEOUT in .testr.conf,
    we also need to bump that. To do so, this refactors the code that
    launches tox to be able to inject OS_TEST_TIMEOUT, if it's set. We need
    to inject that in the command directly, since tox is launched with sudo
    and environment variables are not carried into that process.

    Change-Id: I79c2b026b3f0bac0b96edef0418546bc4e480e3c
    Related-Bug: #1393099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack-gate (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/259035

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack-gate (master)

Reviewed: https://review.openstack.org/259035
Committed: https://git.openstack.org/cgit/openstack-infra/devstack-gate/commit/?id=729ce7986050f377f3e6ff697725057dd525c5a9
Submitter: Jenkins
Branch: master

commit 729ce7986050f377f3e6ff697725057dd525c5a9
Author: Jim Rollenhagen <email address hidden>
Date: Thu Dec 17 06:09:54 2015 -0800

    Bump ironic test timeout to 900

    We're still failing occassionally; data tells us that when we get a slow
    host, it's failing during boot on the last cleaning phase.

    We have found that it's mostly userspace taking so long at boot, so
    we're going down the path of making that faster.

    Bump the timeout while we work on that.

    Change-Id: I9287811d7cdc10e00b3e59c4e6adef8bbe233d2b
    Related-Bug: #1393099

Matt Riedemann (mriedem)
Changed in openstack-gate:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.