Nodes failed to reboot for HA env

Bug #1316761 reported by Tatyanka
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
4.1.x
Fix Released
High
Vladimir Sharshov

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.centos.services/32/console

Test fail_deploy_ceilometer_ha_multirole-2014_05_06__13_33_29 failed with deployment status error. Logs say that nodes failed to reboot.
34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) status is set to error
2014-05-06 13:29:12.612 INFO [7f9152651700] (helpers) Task 34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) progress is set to 100
2014-05-06 13:29:12.612 INFO [7f9152651700] (helpers) Task 34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) message is set to Nodes failed to reboot: ["node-1", "node-2", "node-3", "node-4", "node-5"]

#############
In cobbler:

Tue May 6 13:29:05 2014 - ERROR | ### TASK FAILED ###
Tue May 6 13:29:05 2014 - ERROR | ### TASK FAILED ###
Tue May 6 13:29:05 2014 - INFO | Exception Info:
  File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 89, in run
    rc = self._run(self)
   File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 259, in runner
    self.remote.power_system(object_id,self.options.get("power",""),token,logger=self.logger)
   File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 1921, in power_system
    rc=self.api.reboot(obj, user=None, password=None, logger=logger)
   File "/usr/lib/python2.6/site-packages/cobbler/api.py", line 1031, in reboot
    self.power_off(system, user, password, logger=logger)
   File "/usr/lib/python2.6/site-packages/cobbler/api.py", line 1025, in power_off
    return action_power.PowerTool(self._config,system,self,user,password,logger=logger).power("off")
   File "/usr/lib/python2.6/site-packages/cobbler/action_power.py", line 123, in power
    utils.die(self.logger,"command failed (rc=%s), please validate the physical setup and cobbler config" % rc)
   File "/usr/lib/python2.6/site-packages/cobbler/utils.py", line 134, in die
    raise CX(msg)

Tags: system-tests
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
assignee: Tatyana (tatyana-leontovich) → Fuel Library Team (fuel-library)
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Could not reproduce using friday ISO. I will try to reproduce using iso #6.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

#6 - no problem.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: Incomplete → New
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

We have strange data about node-2. At same time it provisioning and deploying.

2014-05-12T02:51:27 debug: [397] Reboot task status: node: node-2 status: [1399863058.117772, "Power management (reboot)", "running", []]
2014-05-12T02:51:28 debug: [401] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>39, "status"=>"deploying", "role"=>"controller"}]}
 ....
2014-05-12T02:51:37 debug: [397] Reboot task status: node: node-2 status: [1399863058.117772, "Power management (reboot)", "failed", []]
2014-05-12T02:51:37 err: [397] Error occured while trying to reboot: node-2
2014-05-12T02:51:38 debug: [401] Got progress for nodes: [{"uid"=>"2", "progress"=>39}]
2014-05-12T02:51:38 debug: [401] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>39, "status"=>"deploying", "role"=>"controller"}]}

Also cobbler log do not contain any information about original error of this bug about "PowerTool" error.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I think problem here. Worker lost connect to RabbitMQ. RabbitMQ return task to queue and another worker run it. We get 2 same parallel task:

2014-05-12T02:50:44 warning: [401] Trying to reconnect to message broker...
2014-05-12T02:50:45 debug: [397] Process message from worker queue:

Changed in fuel:
status: New → Triaged
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Assigned to Fuel-QA: needs reproducing.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Fuel QA Team (fuel-qa)
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Egor Kotko (ykotko)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Egor Kotko (ykotko) → Vladimir Sharshov (vsharshov)
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

On #205 again.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Most interesting details here. Astute try to report about a progress using RabbitMQ, but in RMQ log we have info about heartbeats on 06:46:27 and new message about error on 10:28:25. No messages between this two events. Also as we can see no report really was received to Nailgun.

Nailgun:
2014-05-15 10:33:27.873 INFO [7f6c59846700] (receiver) RPC method provision_resp received: {"status": "error", "progress": 100, "task_uuid": "0d094871-2b3a-4766-aa3b-64449f983756", "error": "Nodes failed to reboot: [\"node-1\", \"node-2\", \"node-3\", \"node-4\", \"node-5\", \"node-6\"]"}

Astute:
2014-05-15T06:38:18 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\"...
2014-05-15T10:28:22 debug: [397] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"2", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"3", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"4", "progress"=>100, "status"=>"provisioning"}, {"uid"=>"5", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"6", "progress"=>100, "status"=>"provisioning"}]}

2014-05-15T10:28:25 warning: [397] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [412] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [417] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [403] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [420] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [408] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [399] Trying to reconnect to message broker...

2014-05-15T10:28:25 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\",

RabbitMQ:

=INFO REPORT==== 15-May-2014::06:46:27 ===
closing STOMP connection <0.977.0> (172.17.42.1:50879 -> 172.17.0.3:61613)

=INFO REPORT==== 15-May-2014::06:46:27 ===
accepting STOMP connection <0.1188.0> (172.17.42.1:50881 -> 172.17.0.3:61613)

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.349.0> (172.17.42.1:35980 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.346.0> (172.17.42.1:35979 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.299.0> (172.17.42.1:35975 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.337.0> (172.17.42.1:35976 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.352.0> (172.17.42.1:35981 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.340.0> (172.17.42.1:35977 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.343.0> (172.17.42.1:35978 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

At now moment i think that problem in system tests. @alekseyk-ru, does this problem you have then deploy cluster manually?

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

@vsharshov, I deployed the clusters using UI.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Initial reason for this failure is connection drops inside of master node. We will address this particular issue (ability to handle broken connections) in 5.1, and Vladimir S. will create a separate bug regarding connections drops.

Changed in fuel:
milestone: 5.0 → 5.1
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Separate problem with RabbitMQ to this bug: https://bugs.launchpad.net/fuel/+bug/1321276

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Tatyana, regarding https://bugs.launchpad.net/fuel/+bug/1316761/comments/9 with failed power reset, this issue looks like the provisioning process broke. The nodes were rebooted from bootstrap but then have no logs of them starting install.
PowerTool in Cobbler calls fence_ssh, which reports a useless error when it can't connect.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Matt, in Tatyana logs same problem with RabbitMQ . When one of worker which already perform provision operation miss connection. After it try to run task again (just a coincidence that the current worker take this task again).

2014-05-15T10:28:22 debug: [397] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"2", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"3", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"4", "progress"=>100, "status"=>"provisioning"}, {"uid"=>"5", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"6", "progress"=>100, "status"=>"provisioning"}]}
2014-05-15T10:28:25 warning: [397] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [412] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [417] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [403] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [420] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [408] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [399] Trying to reconnect to message broker...
2014-05-15T10:28:25 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\", \"provisioning_info\"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94849

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/94849
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=55f9687e635ea3247aff23cae82f6b44b371cc3d
Submitter: Jenkins
Branch: master

commit 55f9687e635ea3247aff23cae82f6b44b371cc3d
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 22 16:26:52 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist;
    - more correct message about publishing report to Nailgun.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/98332

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/4.1)

Reviewed: https://review.openstack.org/98332
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=bdc747cd70721874d1e8af36181fd4179cce4fce
Submitter: Jenkins
Branch: stable/4.1

commit bdc747cd70721874d1e8af36181fd4179cce4fce
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jun 6 12:44:41 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.
    Backport from 5.0

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/106898

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/106898
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=53136d8df65bebcea8cead7310eb85021612b8d3
Submitter: Jenkins
Branch: stable/5.0

commit 53136d8df65bebcea8cead7310eb85021612b8d3
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 22 16:26:52 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist;
    - more correct message about publishing report to Nailgun.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761
    (cherry picked from commit 55f9687e635ea3247aff23cae82f6b44b371cc3d)

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.