[systests] Nodes fail to enter provisioning state after snapshot revert

Bug #1260211 reported by Vladimir Kuklin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Low
Vladimir Kuklin

Bug Description

Murano deployment failed due to puppet did not even start. Research shows that provision phase did not even start: all nodes remain in bootstrap mode. There are two log archives:
the first one says that nodes failed to reboot after provisioning phase, even though they were not even rebooted from bootstrap into system installer
the second one shows that even provisioning phase was abandoned by nailgun.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

2013-12-12T02:03:20 err: [8923] Error occured while trying to reboot: node-1
2013-12-12T02:03:20 debug: [8923] Reboot task status: node: node-2 status: [1386805749.712502, "Power management (reboot)", "failed", []]
2013-12-12T02:03:20 err: [8923] Error occured while trying to reboot: node-2
2013-12-12T02:03:20 debug: [8923] Reboot task status: node: node-3 status: [1386805749.712502, "Power management (reboot)", "failed", []]
2013-12-12T02:03:20 err: [8923] Error occured while trying to reboot: node-3
2013-12-12T02:03:20 debug: [8923] Reboot task status: node: node-4 status: [1386813795.484784, "Power management (reboot)", "running", []]
2013-12-12T02:03:20 debug: [8923] Reboot task status: node: node-5 status: [1386813795.484784, "Power management (reboot)", "running", []]
2013-12-12T02:03:25 debug: [8923] Reboot task status: node: node-4 status: [1386813795.484784, "Power management (reboot)", "complete", []]
2013-12-12T02:03:25 debug: [8923] Successfully rebooted: node-4
2013-12-12T02:03:25 debug: [8923] Reboot task status: node: node-5 status: [1386813795.484784, "Power management (reboot)", "complete", []]
2013-12-12T02:03:25 debug: [8923] Successfully rebooted: node-5
2013-12-12T02:03:31 err: [8923] Nodes failed to reboot: ["node-1", "node-2", "node-3"]

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
summary: - Murano deployment failed: puppet did not even start
+ Nodes fail to enter provisioning state
description: updated
Evgeniy L (rustyrobot)
Changed in fuel:
assignee: nobody → Evgeniy L (rustyrobot)
status: New → Confirmed
Revision history for this message
Evgeniy L (rustyrobot) wrote : Re: Nodes fail to enter provisioning state

Second log archive "fail_deploy_murano_simple-2013_12_12__02_24_48.tar.gz"

You have offline nodes for some reason
2013-12-12 02:23:08.098 INFO [7fc4e220b700] (notifier) Notification: topic: error message: Node 'Untitled (B3:1C)' has gone away
2013-12-12 02:23:08.110 INFO [7fc4e220b700] (notifier) Notification: topic: error message: Node 'Untitled (8F:A8)' has gone away
2013-12-12 02:23:08.131 INFO [7fc4e220b700] (notifier) Notification: topic: error message: Node 'Untitled (CD:7E)' has gone away
2013-12-12 02:23:08.142 INFO [7fc4e220b700] (notifier) Notification: topic: error message: Node 'Untitled (7F:F4)' has gone away
2013-12-12 02:23:08.156 INFO [7fc4e220b700] (notifier) Notification: topic: error message: Node 'Untitled (E1:62)' has gone away

And they didn't have time to back online, so, you needed to try to run deployment again in ~30 seconds.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

I've looked at the first case "fail_deploy_murano_simple-2013_12_12__02_06_31.tar.gz "

It looks like we have some problem with cobbler rpc

Thu Dec 12 02:03:16 2013 - ERROR | ### TASK FAILED ###
Thu Dec 12 02:03:16 2013 - INFO | Exception occured: <class 'cobbler.cexceptions.CX'>
Thu Dec 12 02:03:16 2013 - INFO | Exception value: 'invalid token: Z3/b4/lc3QSoI/34Whdcu9cYz+rHRn3bsw=='

I saw such error three times in logs, it equals to count off failed nodes.
It may be due this commit https://github.com/stackforge/fuel-astute/commit/3055cb18099db0037a37edaa45850cfdeecec03e

Vladimir S, what do you think about this?

Changed in fuel:
assignee: Evgeniy L (rustyrobot) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

As i can see in logs, system was setup and send task at 23.49, but answer from nodes was get after 2 hours (02.03).

Wed Dec 11 23:49:09 2013 - INFO | REMOTE start_task(Power management (reboot)); event_id(2013-12-11_234909_power); user(?)
Thu Dec 12 02:03:15 2013 - INFO | authenticate; ['cobbler', True]
Thu Dec 12 02:03:15 2013 - DEBUG | REMOTE expiring token; user(<DIRECT>)

And tasks added to reboot at 02.13 was success, but tasks 23.49 fail by understanding problem (token expired).

I think what problem not in Astute (at least not whole), but at snapshot system what generate snapshot at unexpected time at middle of Astute execution, when some nodes already sending to reboot and some not.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Tatyana (tatyana-leontovich)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

So, do you have any idea how can we fix it?
I heard that we can hack cobbler to increase the expiration timeout (but it may cause security problems).
Also why this fix https://github.com/stackforge/fuel-astute/commit/3055cb18099db0037a37edaa45850cfdeecec03e didn't help?
Is it because they made snapshot after token was requested but before action was synced?

Should we try to fix it in the current release (4.0)?

Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Tatyana (tatyana-leontovich) → Vladimir Kuklin (vkuklin)
summary: - Nodes fail to enter provisioning state
+ [systests] Nodes fail to enter provisioning state after snapshot
+ provision
summary: - [systests] Nodes fail to enter provisioning state after snapshot
- provision
+ [systests] Nodes fail to enter provisioning state after snapshot revert
Revision history for this message
Mike Scherbakov (mihgen) wrote :

After conversation with Igor, moved to 4.1

Changed in fuel:
milestone: 4.0 → 4.1
Changed in fuel:
importance: Critical → Medium
importance: Medium → Low
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 4.1 → 5.0
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Is this a duplicate of fixed bug: https://bugs.launchpad.net/fuel/+bug/1256006 ?

The correct action is for astute to fetch a new Cobbler auth token if the current session is invalid.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.