Nodes failed to reboot for HA env

Bug #1316761 reported by Tatyanka on 2014-05-06
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Sharshov
4.1.x
High
Vladimir Sharshov

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_0_swarm/job/master_fuelmain.system_test.centos.services/32/console

Test fail_deploy_ceilometer_ha_multirole-2014_05_06__13_33_29 failed with deployment status error. Logs say that nodes failed to reboot.
34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) status is set to error
2014-05-06 13:29:12.612 INFO [7f9152651700] (helpers) Task 34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) progress is set to 100
2014-05-06 13:29:12.612 INFO [7f9152651700] (helpers) Task 34cd745b-694e-4d9d-9cb1-2d7e938b7d0c (provision) message is set to Nodes failed to reboot: ["node-1", "node-2", "node-3", "node-4", "node-5"]

#############
In cobbler:

Tue May 6 13:29:05 2014 - ERROR | ### TASK FAILED ###
Tue May 6 13:29:05 2014 - ERROR | ### TASK FAILED ###
Tue May 6 13:29:05 2014 - INFO | Exception Info:
  File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 89, in run
    rc = self._run(self)
   File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 259, in runner
    self.remote.power_system(object_id,self.options.get("power",""),token,logger=self.logger)
   File "/usr/lib/python2.6/site-packages/cobbler/remote.py", line 1921, in power_system
    rc=self.api.reboot(obj, user=None, password=None, logger=logger)
   File "/usr/lib/python2.6/site-packages/cobbler/api.py", line 1031, in reboot
    self.power_off(system, user, password, logger=logger)
   File "/usr/lib/python2.6/site-packages/cobbler/api.py", line 1025, in power_off
    return action_power.PowerTool(self._config,system,self,user,password,logger=logger).power("off")
   File "/usr/lib/python2.6/site-packages/cobbler/action_power.py", line 123, in power
    utils.die(self.logger,"command failed (rc=%s), please validate the physical setup and cobbler config" % rc)
   File "/usr/lib/python2.6/site-packages/cobbler/utils.py", line 134, in die
    raise CX(msg)

Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
assignee: Tatyana (tatyana-leontovich) → Fuel Library Team (fuel-library)
Mike Scherbakov (mihgen) on 2014-05-07
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
Vladimir Sharshov (vsharshov) wrote :

Could not reproduce using friday ISO. I will try to reproduce using iso #6.

Changed in fuel:
status: New → Incomplete
Vladimir Sharshov (vsharshov) wrote :

#6 - no problem.

Vladimir Sharshov (vsharshov) wrote :

We have strange data about node-2. At same time it provisioning and deploying.

2014-05-12T02:51:27 debug: [397] Reboot task status: node: node-2 status: [1399863058.117772, "Power management (reboot)", "running", []]
2014-05-12T02:51:28 debug: [401] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>39, "status"=>"deploying", "role"=>"controller"}]}
 ....
2014-05-12T02:51:37 debug: [397] Reboot task status: node: node-2 status: [1399863058.117772, "Power management (reboot)", "failed", []]
2014-05-12T02:51:37 err: [397] Error occured while trying to reboot: node-2
2014-05-12T02:51:38 debug: [401] Got progress for nodes: [{"uid"=>"2", "progress"=>39}]
2014-05-12T02:51:38 debug: [401] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"2", "progress"=>39, "status"=>"deploying", "role"=>"controller"}]}

Also cobbler log do not contain any information about original error of this bug about "PowerTool" error.

Vladimir Sharshov (vsharshov) wrote :

I think problem here. Worker lost connect to RabbitMQ. RabbitMQ return task to queue and another worker run it. We get 2 same parallel task:

2014-05-12T02:50:44 warning: [401] Trying to reconnect to message broker...
2014-05-12T02:50:45 debug: [397] Process message from worker queue:

Changed in fuel:
status: New → Triaged
Mike Scherbakov (mihgen) wrote :

Assigned to Fuel-QA: needs reproducing.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Fuel QA Team (fuel-qa)
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Egor Kotko (ykotko)
Tatyanka (tatyana-leontovich) wrote :
Mike Scherbakov (mihgen) on 2014-05-15
Changed in fuel:
assignee: Egor Kotko (ykotko) → Vladimir Sharshov (vsharshov)
Aleksey Kasatkin (alekseyk-ru) wrote :

On #205 again.

Vladimir Sharshov (vsharshov) wrote :

Most interesting details here. Astute try to report about a progress using RabbitMQ, but in RMQ log we have info about heartbeats on 06:46:27 and new message about error on 10:28:25. No messages between this two events. Also as we can see no report really was received to Nailgun.

Nailgun:
2014-05-15 10:33:27.873 INFO [7f6c59846700] (receiver) RPC method provision_resp received: {"status": "error", "progress": 100, "task_uuid": "0d094871-2b3a-4766-aa3b-64449f983756", "error": "Nodes failed to reboot: [\"node-1\", \"node-2\", \"node-3\", \"node-4\", \"node-5\", \"node-6\"]"}

Astute:
2014-05-15T06:38:18 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\"...
2014-05-15T10:28:22 debug: [397] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"2", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"3", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"4", "progress"=>100, "status"=>"provisioning"}, {"uid"=>"5", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"6", "progress"=>100, "status"=>"provisioning"}]}

2014-05-15T10:28:25 warning: [397] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [412] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [417] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [403] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [420] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [408] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [399] Trying to reconnect to message broker...

2014-05-15T10:28:25 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\",

RabbitMQ:

=INFO REPORT==== 15-May-2014::06:46:27 ===
closing STOMP connection <0.977.0> (172.17.42.1:50879 -> 172.17.0.3:61613)

=INFO REPORT==== 15-May-2014::06:46:27 ===
accepting STOMP connection <0.1188.0> (172.17.42.1:50881 -> 172.17.0.3:61613)

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.349.0> (172.17.42.1:35980 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.346.0> (172.17.42.1:35979 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.299.0> (172.17.42.1:35975 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.337.0> (172.17.42.1:35976 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.352.0> (172.17.42.1:35981 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.340.0> (172.17.42.1:35977 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 15-May-2014::10:28:25 ===
closing AMQP connection <0.343.0> (172.17.42.1:35978 -> 172.17.0.3:5672):
{heartbeat_timeout,running}

Vladimir Sharshov (vsharshov) wrote :

At now moment i think that problem in system tests. @alekseyk-ru, does this problem you have then deploy cluster manually?

Aleksey Kasatkin (alekseyk-ru) wrote :

@vsharshov, I deployed the clusters using UI.

Mike Scherbakov (mihgen) wrote :

Initial reason for this failure is connection drops inside of master node. We will address this particular issue (ability to handle broken connections) in 5.1, and Vladimir S. will create a separate bug regarding connections drops.

Changed in fuel:
milestone: 5.0 → 5.1
Vladimir Sharshov (vsharshov) wrote :

Separate problem with RabbitMQ to this bug: https://bugs.launchpad.net/fuel/+bug/1321276

Matthew Mosesohn (raytrac3r) wrote :

Tatyana, regarding https://bugs.launchpad.net/fuel/+bug/1316761/comments/9 with failed power reset, this issue looks like the provisioning process broke. The nodes were rebooted from bootstrap but then have no logs of them starting install.
PowerTool in Cobbler calls fence_ssh, which reports a useless error when it can't connect.

Vladimir Sharshov (vsharshov) wrote :

Matt, in Tatyana logs same problem with RabbitMQ . When one of worker which already perform provision operation miss connection. After it try to run task again (just a coincidence that the current worker take this task again).

2014-05-15T10:28:22 debug: [397] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"1", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"2", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"3", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"4", "progress"=>100, "status"=>"provisioning"}, {"uid"=>"5", "progress"=>100, "status"=>"provisioned"}, {"uid"=>"6", "progress"=>100, "status"=>"provisioning"}]}
2014-05-15T10:28:25 warning: [397] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [412] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [417] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [403] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [420] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [408] Trying to reconnect to message broker...
2014-05-15T10:28:25 warning: [399] Trying to reconnect to message broker...
2014-05-15T10:28:25 debug: [397] Process message from worker queue: "[{\"args\": {\"task_uuid\": \"0d094871-2b3a-4766-aa3b-64449f983756\", \"provisioning_info\"

Fix proposed to branch: master
Review: https://review.openstack.org/94849

Changed in fuel:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/94849
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=55f9687e635ea3247aff23cae82f6b44b371cc3d
Submitter: Jenkins
Branch: master

commit 55f9687e635ea3247aff23cae82f6b44b371cc3d
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 22 16:26:52 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist;
    - more correct message about publishing report to Nailgun.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/98332
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=bdc747cd70721874d1e8af36181fd4179cce4fce
Submitter: Jenkins
Branch: stable/4.1

commit bdc747cd70721874d1e8af36181fd4179cce4fce
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jun 6 12:44:41 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.
    Backport from 5.0

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761

Reviewed: https://review.openstack.org/106898
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=53136d8df65bebcea8cead7310eb85021612b8d3
Submitter: Jenkins
Branch: stable/5.0

commit 53136d8df65bebcea8cead7310eb85021612b8d3
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 22 16:26:52 2014 +0400

    Tolerant behavior in case of RabbitMQ connection problem

    - do not requeue task if we have problem with RabbitMQ;
    - prevent throttle if only one worker exist;
    - more correct message about publishing report to Nailgun.

    Notice: after this change Nailgun will not fail task if
    Astute perform unexpected exit because we do not requeue
    task. In this case we will again see dangling deploy
    instead of incorrect error.

    Special thanks to Evgeniy L for help.

    Change-Id: Ic2e771a6133eb0f95e478417f30c1398647c2597
    Closes-Bug: #1316761
    (cherry picked from commit 55f9687e635ea3247aff23cae82f6b44b371cc3d)

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers