Simultaneous adding/removing nodes failed on pre-deployment hook 'yum clean all'

Bug #1446202 reported by Andrey Sledzinskiy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dima Shulyak

Bug Description

{"build_id": "2015-04-17_15-24-00",
"ostf_sha": "4bab9b975ace8d9a305d6e0f112b734de587f847",
"build_number": "321",
"release_versions": {
"2014.2-6.1": {
"VERSION": {
"build_id": "2015-04-17_15-24-00",
"ostf_sha": "4bab9b975ace8d9a305d6e0f112b734de587f847",
"build_number": "321",
"api": "1.0",
"nailgun_sha": "939e5780cd0f7b1af3afd2926eda30f81bfc3e3f",
"openstack_version": "2014.2-6.1",
"production": "docker",
"python-fuelclient_sha": "0698062e9b044becf07bf9918fa16613aa3d93ad",
"astute_sha": "bf1751a4fe0d912325e3b4af629126a59c1b2b51",
"feature_groups":
["mirantis"],
"release": "6.1",
"fuelmain_sha": "5981d230e9484c196022a027c5c1600e36b17a72",
"fuellib_sha": "65617981bef34ea96b85d2d389cc037c304516e5"}}},
"auth_required": true,
"api": "1.0",
"nailgun_sha": "939e5780cd0f7b1af3afd2926eda30f81bfc3e3f",
"openstack_version": "2014.2-6.1",
"production": "docker",
"python-fuelclient_sha": "0698062e9b044becf07bf9918fa16613aa3d93ad",
"astute_sha": "bf1751a4fe0d912325e3b4af629126a59c1b2b51",
"feature_groups":
["mirantis"],
"release": "6.1",
"fuelmain_sha": "5981d230e9484c196022a027c5c1600e36b17a72",
"fuellib_sha": "65617981bef34ea96b85d2d389cc037c304516e5"}

Steps:
1. Create next cluster - CentOS, HA, flat nova-network, 3 controller, 2 compute
2. Deploy cluster
3. After deployment delete 1 controller and cinder and add 1 controller+cinder
4. Start re-deployment

Actual result - re-deployment failed on pre-deployment hook:
2015-04-20 04:22:10 ERR

[734] Error running RPC method granular_deploy: Failed to execute hook .
---
priority: 800
type: shell
uids:
- '1'
- '2'
- '5'
- '4'
- '7'
- '6'
parameters:
  retries: 3
  cmd: yum clean all
  cwd: "/"
  timeout: 180
  interval: 1
, trace:
["/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/nailgun_hooks.rb:54:in `block in process'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/nailgun_hooks.rb:26:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/nailgun_hooks.rb:26:in `process'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/deployment_engine/granular_deployment.rb:203:in `pre_deployment_actions'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/deployment_engine.rb:32:in `deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/orchestrator.rb:133:in `deploy_cluster'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/orchestrator.rb:56:in `granular_deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/dispatcher.rb:111:in `granular_deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/server.rb:142:in `dispatch_message'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/server.rb:103:in `block in dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/task_queue.rb:64:in `call'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/task_queue.rb:64:in `block in each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/server.rb:101:in `each_with_index'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/server.rb:101:in `dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.0.0/lib/astute/server/server.rb:85:in `block in perform_main_job'"]

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Dmitry Ilyin (idv1985)
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Seems that it is related to orchestration.

2015-04-20T04:22:06 warning: [734] Problem while performing cmd. Try to repeat: 0 attempt
2015-04-20T04:22:07 debug: [734] 5b56cd90-0969-45aa-8544-1a5a67bc5934: MC agent 'execute_shell_command', method 'execute', results: {:sender=>"7", :statuscode=>0, :statusmsg=>"OK", :data=>{:stdout=>"", :ex
it_code=>127, :stderr=>"sh: yum: command not found\n"}}

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Fuel Python Team (fuel-python)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Sebastian Kalinowski (prmtl)
Revision history for this message
Sebastian Kalinowski (prmtl) wrote :

For now I didn't reproduce it, I'll try again in some time.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Sebastian Kalinowski (prmtl) → Fuel Python Team (fuel-python)
Revision history for this message
Evgeniy L (rustyrobot) wrote :

>> sh: yum: command not found

This probably means that nodes were not provisioned and rebooted after that. In Nailgun we can see that node was not send for provisioning.

Andrey, was it reproduced with system tests or manually?

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

+1. This is mean that we run this tasks on 'bootstrap' nodes. Also looks like this bug with controllers connect with this https://bugs.launchpad.net/fuel/+bug/1394188

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

After observation i think that this bug is most similar: https://bugs.launchpad.net/fuel/+bug/1450051

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
Revision history for this message
Sebastian Kalinowski (prmtl) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/181424

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Guys, looks like we have problem in other place. I've checked logic and looks like original reason is this:

2015-04-20 04:17:14.036 INFO [7f2b1f8ae700] (receiver) RPC method remove_nodes_resp received: {"status": "ready", "progress": 100, "task_uuid": "5c7cfc4a-91ae-449b-8fb4-7745b2998839", "nodes": [{"uid": "7"}, {"uid": "3"}]}
2015-04-20 04:17:15.119 ERROR [7f2b1f8ae700] (receiverd) Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/nailgun/rpc/receiverd.py", line 52, in consume_msg
    callback(**body["args"])
  File "/usr/lib/python2.6/site-packages/nailgun/rpc/receiver.py", line 90, in remove_nodes_resp
    objects.NodeCollection.lock_for_update(locked_nodes).all()
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2300, in all
    return list(self)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2412, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2427, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
    context)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
    exc_info
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context
    context)
  File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute
    cursor.execute(statement, parameters)
DBAPIError: (TransactionRollbackError) deadlock detected

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Alexander Kislitsky (akislitsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Vladimir Sharshov (<email address hidden>) on branch: master
Review: https://review.openstack.org/181424
Reason: I think that i try to find problem in wrong place.
https://bugs.launchpad.net/fuel/+bug/1446202/comments/9

Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

The reason of error is not in Deadlock, but in wrong nodes list to deploy in the task

Changed in fuel:
assignee: Alexander Kislitsky (akislitsky) → Vladimir Sharshov (vsharshov)
Dmitry Pyzhov (dpyzhov)
tags: added: tricky
tags: added: module-astute module-tasks
removed: granular
tags: added: module-nailgun
removed: module-astute
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

This fix https://review.openstack.org/#/c/183081/ can partially solve one of consequent then removed node present in the list. But if node already return to online state, we still have problem with wrong nodes list

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Updated: this fix should solve consequent in both cases if problem node not critical. In our case problem node is Cinder with fail_if_error -> false.

But we still do not find original source of problem. Also this related fix will not help in case of critical role: mongo, primary-controller and so on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/184115

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Dima Shulyak (dshulyak)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/184245
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=ea7ea8ffff4b3f0f54ea21d6ef6ca9a01fe2f7aa
Submitter: Jenkins
Branch: master

commit ea7ea8ffff4b3f0f54ea21d6ef6ca9a01fe2f7aa
Author: Dmitry Shulyak <email address hidden>
Date: Tue May 19 16:23:40 2015 +0300

    Do not include pending deletion node in upgradable nodes

    Previously pending for deletion nodes were included in deployment
    if they were selected for update, by deploying other roles

    You can observe one example of this behaviour in test

    Closes-Bug: 1446202
    Change-Id: Ic80e1421eea6a3227fc8d890bfdbab933da72fbc

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

http://jenkins-product.srt.mirantis.net:8080/view/custom_iso/job/6.1.custom_system_test/119/ (OK)
It is red, because of problem in middle between tests and Jenkins, but tests finished with success.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.