ironic jobs sometimes fail with node callback timeout

Bug #1408067 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Invalid
High
Unassigned

Bug Description

This was on stable/juno in the gate:

http://logs.openstack.org/60/144760/1/gate/gate-grenade-dsvm-ironic-sideways/5d08c02/console.html#_2015-01-06_18_54_34_703

Details: (ServerAddressesTestJSON:setUpClass) Server 4e4b8db2-9b2c-470b-a1d9-028696199413 failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: spawning.

The last time that instance uuid shows up in the nova-compute logs is here:

http://logs.openstack.org/60/144760/1/gate/gate-grenade-dsvm-ironic-sideways/5d08c02/logs/new/screen-n-cpu.txt.gz#_2015-01-06_18_54_36_732

And there are a ton of these it seems:

2015-01-06 18:54:36.732 21016 DEBUG nova.virt.ironic.driver [-] [instance: 4e4b8db2-9b2c-470b-a1d9-028696199413] Still waiting for ironic node 1e0a7d23-b2b7-4fc4-b2d4-c98ad059ec8f to become ACTIVE: power_state="power on", target_power_state=None, provision_state="wait call-back", target_provision_state="deploy complete" _log_ironic_polling /opt/stack/new/nova/nova/virt/ironic/driver.py:150

Looking at the ironic conductor logs, the only time I see the instance_uuid is here:

http://logs.openstack.org/60/144760/1/gate/gate-grenade-dsvm-ironic-sideways/5d08c02/logs/new/screen-ir-cond.txt.gz#_2015-01-06_18_38_38_708

Shortly after that the node is powered on:

http://logs.openstack.org/60/144760/1/gate/gate-grenade-dsvm-ironic-sideways/5d08c02/logs/new/screen-ir-cond.txt.gz#_2015-01-06_18_39_02_570

There are a ton of rpc timeout debug messages in the ironic conductor log, I don't know if that's normal or not.

I'm not familiar with the ironic code but it looks like something is hung here and it seems we need some warning/info logging in here to trace when we think we're hung or hitting some timeout because right now there are no errors in the nova or ironic logs so I can't fingerprint on anything in logstash to track this race.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ah there is a kernel panic in the baremetal logs:

http://logs.openstack.org/60/144760/1/gate/gate-grenade-dsvm-ironic-sideways/5d08c02/logs/ironic-bm-logs/baremetalbrbm_2_console.txt.gz

[ 422.148000] Kernel panic - not syncing: Fatal exception in interrupt

Revision history for this message
Matt Riedemann (mriedem) wrote :

From discussion in IRC, the deploy_callback_timeout is the default 30 minutes in the gate run, the tempest server build timeout is 5 minutes, so if we lowered deploy_callback_timeout to something like 4 minutes if we hit a deploy timeout with ironic we'd get a more specific error to fingerprint, but if it's too low we could be shooting ourselves in the foot.

I'd be nice to get the baremetal logs indexed by logstash because then we could fingerprint on the kernel panic in those logs which is the root error.

Changed in ironic:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

The following patches should set timeouts in such a way that we end up failing on an ironic-condcutor timeout, rather than a generic client/tempest timeout, giving us something to fingerprint and track:

https://review.openstack.org/#/c/145326/ (devstack)
https://review.openstack.org/#/c/145327/ (devstack-gate)

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

17:50:00 adam_g | jroll, i think its happening periodically, i have some devstack patches up to make the failure result in an ironic error, so we have something to fingerprint and track
17:50:22 adam_g | ill link on the bug

summary: - gate-grenade-dsvm-ironic-sideways fails with instance build timeout
+ ironic jobs sometimes fail with instance build timeout
summary: - ironic jobs sometimes fail with instance build timeout
+ ironic jobs sometimes fail with node callback timeout
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

So the timeouts were adjusted so that callback timeout is now 240sec / 4min. I'm wondering if this is too aggressive for IPA, or if there's a bug in the ramdisk. It looks like the time between power on and callback is pretty variable, and is sometimes passing very close to the 4min timeout.

Revision history for this message
aeva black (tenbrae) wrote :

Just ran into some failures that look related to this

One was due to deletion taking over 5 minutes:
  http://logs.openstack.org/36/172036/2/gate/gate-tempest-dsvm-ironic-agent_ssh/14006ac/console.html#_2015-04-13_18_56_21_743

Another was due to a kernel panic in the node (vm) leading to deploy timeout:
  http://logs.openstack.org/15/172715/2/check/check-tempest-dsvm-ironic-pxe_ssh/efef2e0/logs/ironic-bm-logs/baremetalbrbm_0_console.txt.gz

There are probably other causes of this bug. Let's start collating the failures here and then filing more specific reports as we identify discrete causes.

Changed in ironic:
importance: Medium → High
Revision history for this message
Arun S A G (sagarun) wrote :
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

Arun, here "No valid hosts was found" is solely due to the fact that the job creates only one VM to pose as Ironic node. The node failed to provision by timeout, and nova scheduler tried to re-schedule the instance to another node, but there is only one of them.

In this particular gate run, the boot log from VM posing as Ironic node is quite truncated, so I would suspect a somehow very slow node, such as the provision timed out before the bootstrap image could even start appropriately.
http://logs.openstack.org/21/257521/1/check/gate-tempest-dsvm-ironic-agent_ssh-src/01360b0/logs/ironic-bm-logs/baremetalbrbm_0_console.txt.gz

Revision history for this message
Matt Riedemann (mriedem) wrote :

We haven't seen this in awhile so I'm marking it invalid, we can re-open if it comes back.

Changed in ironic:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.