[10.0-mitaka]bvt2: Rabbitmq starts too long due to OCF script false-negative generic error on secondary nodes

Bug #1599479 reported by Alexey Golubev
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alex Schultz
Mitaka
Invalid
High
Alex Schultz

Bug Description

Failed Jenkins job: https://product-ci.infra.mirantis.net/job/10.0-mitaka.main.ubuntu.bvt_2/58

From sys_test.log:
2016-07-06 11:28:56,051 - DEBUG __init__.py:61 -- Done: task_wait with result: {u'status': u'error', u'name': u'deploy', u'cluster': 1, u'result': {}, u'progress': 100, u'message': u'Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/2] Stopping the deployment process!', u'id': 5, u'uuid': u'e956a4e3-0d59-499a-adf5-d17a5031c201'}
2016-07-06 11:28:56,057 - ERROR __init__.py:66 -- assert_task_success raised: AssertionError("Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/2] Stopping the deployment process!'",)
Traceback: Traceback (most recent call last):
  File "/home/jenkins/workspace/10.0-mitaka.main.ubuntu.bvt_2/fuelweb_test/__init__.py", line 59, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-mitaka.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 330, in assert_task_success
    task["name"], task['status'], 'ready', _message(task)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/asserts.py", line 55, in assert_equal
    raise ASSERTION_ERROR(message)
AssertionError: Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/2] Stopping the deployment process!'

summary: - [10.0-mitaka]bwt2: Failed tasks: Task[primary-rabbitmq/2]
+ [10.0-mitaka]bvt2: Failed tasks: Task[primary-rabbitmq/2]
tags: added: bvt-failure
Roman Vyalov (r0mikiam)
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → High
milestone: none → 10.0
Revision history for this message
Maksim Malchuk (mmalchuk) wrote : Re: [10.0-mitaka]bvt2: Failed tasks: Task[primary-rabbitmq/2]

snapshot again without logs, but maybe env exists, will ask #infra for access

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Maksim Malchuk (mmalchuk)
status: New → Confirmed
tags: added: area-library
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

primary-rabbitmq task failed by timeout, starting rabbitmq-server takes 22 minutes, but we have the time out for the task set to 900sec (15 min).

task timeouts were decreased in the https://review.openstack.org/#/c/320281/

as a quick solution I propose to revert back the timeout value for primary-rabbitmq task back

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/338372

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: [10.0-mitaka]bvt2: Failed tasks: Task[primary-rabbitmq/2]

The root cause is that monitor command on the nodes that are not yet running rabbitmq return incorrect exit code 1 instead of 7 which leads to a cascade of post-stop notifications. Sometimes these post-stop notifications interfere with regular start sequence leading to this bug

2016-07-06T11:17:42.193913+00:00 warning: warning: Action 23 (p_rabbitmq-server_monitor_0) on node-4.test.domain.local failed (target: 7 vs. rc: 1): Error
2016-07-06T11:17:42.194349+00:00 warning: warning: Action 23 (p_rabbitmq-server_monitor_0) on node-4.test.domain.local failed (target: 7 vs. rc: 1): Error
2016-07-06T11:17:42.197117+00:00 warning: warning: Action 22 (p_rabbitmq-server_monitor_0) on node-3.test.domain.local failed (target: 7 vs. rc: 1): Error
2016-07-06T11:17:42.197469+00:00 warning: warning: Action 22 (p_rabbitmq-server_monitor_0) on node-3.test.domain.local failed (target: 7 vs. rc: 1): Error

summary: - [10.0-mitaka]bvt2: Failed tasks: Task[primary-rabbitmq/2]
+ [10.0-mitaka]bvt2: Rabbitmq starts too long due to OCF script false-
+ negative generic error on secondary nodes
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please explain the high impact and "too long" criterion

Changed in fuel:
importance: High → Medium
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Normally it doesn't start for 22 minutes, this should be the env specific issues, never observed at standard installs. Need more reproduces outside of CI/swarm nodes.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So, another issue is that why we are running monitor commands for services not designated to run on these nodes. This could be an issue with our pacemaker configuration.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Bogdan, it is obvious according to lrmd.log that pacemaker gets RC 1 for nodes that should not run rabbitmq at all on these nodes. So it monitors nodes and sees that services get GENERIC_ERR instead of NOT_RUNNING.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Bogdan, it was High because of BVT failure.

Changed in fuel:
importance: Medium → High
tags: added: tricky
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> Bogdan, it is obvious according to lrmd.log that pacemaker gets RC 1 for nodes that should not run rabbitmq at all on these nodes

what do you mean? It is a clone and shall be running on all nodes in the cluster

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Bogdan, primary-rabbitmq started only on primary controller, in this time other controllers doesn't have the rabbitmq-server running. But OCF script tries to stop rabbitmq-server on the other controllers.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/338372
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b856aec87533ecefa7c3e6b8a9ce2c11a305f0df
Submitter: Jenkins
Branch: master

commit b856aec87533ecefa7c3e6b8a9ce2c11a305f0df
Author: Maksim Malchuk <email address hidden>
Date: Wed Jul 6 18:56:44 2016 +0300

    Increase the timeout for primary-rabbitmq task

    Sometimes starting rabbitmq-server on the primary controller takes
    more than 15 minutes, the timeouts were decreased in the [0] fix for
    all tasks. This change increases the timeout to 1800 (30 min).

    [0] https://review.openstack.org/#/c/320281/

    Change-Id: Ia653d63601e0898a0d6ab850ec391a3898df470a
    Related-Bug: #1599479
    Signed-off-by: Maksim Malchuk <email address hidden>

Revision history for this message
Alex Schultz (alex-schultz) wrote :

The lrmd.log points to the fact that the fuel rabbitmq ocf script is put in place but since we're just doing a source of the upstream script, it's erroring on that because it's not installed yet.

2016-07-06T11:17:06.488826+00:00 notice: notice: p_rabbitmq-server_monitor_0:32505:stderr [ /usr/lib/ocf/resource.d/fuel/rabbitmq-server: line 30: /usr/lib/ocf/resource.d/rabbitmq/rabbitmq-server-ha: No such file or directory ]

This seems to be an issue with the way we're leveraging the upstream script.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/339216

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Alex Schultz (alex-schultz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/339238

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/339216
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f60e93dfe6ac842e0b3a0b3d451d98c77196454d
Submitter: Jenkins
Branch: master

commit f60e93dfe6ac842e0b3a0b3d451d98c77196454d
Author: Alex Schultz <email address hidden>
Date: Thu Jul 7 14:06:28 2016 -0600

    Check if upstream rabbitmq ocf script is installed

    This change adds a check to the fuel rabbitmq ocf script to ensure we
    are returning the correct return code. Because the rabbitmq cluster
    configuration may exist on all nodes prior to the rabbitmq-server
    package, we get into a state where the rabbitmq-server script is
    partially available. But since we are sourcing the upstream script, when
    it fails it returns 1 which is being interpreted as the wrong error.
    This change returns not running if the script is not available since
    clearly we can't be running if the package is not installed yet.

    Change-Id: Idbc2a9ded39a47e06183793ac4a63115f93c9ba6
    Closes-Bug: #1599479

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/339238
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=13797086b026def18b27338004fd835bd7697d5a
Submitter: Jenkins
Branch: master

commit 13797086b026def18b27338004fd835bd7697d5a
Author: Alex Schultz <email address hidden>
Date: Thu Jul 7 21:13:40 2016 +0000

    Revert "Increase the timeout for primary-rabbitmq task"

    This reverts commit b856aec87533ecefa7c3e6b8a9ce2c11a305f0df.

    Change-Id: I66356a179f40f1a5ad94feb99b6c6b20f9be5eb8
    Related-Bug: #1599479

Revision history for this message
Alex Schultz (alex-schultz) wrote :

The fix still was missing the include for ocf_log so now when this occurs the error is:
2016-07-12T23:36:30.625253+00:00 notice: notice: p_rabbitmq-server_notify_0:6243:stderr [ /usr/lib/ocf/resource.d/fuel/rabbitmq-server: line 33: ocf_log: command not found ]

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/341826

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/341826
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=058cd5bf0f24d1f0590a4eba6dddec23ade003ba
Submitter: Jenkins
Branch: master

commit 058cd5bf0f24d1f0590a4eba6dddec23ade003ba
Author: Alex Schultz <email address hidden>
Date: Wed Jul 13 16:19:40 2016 -0600

    Include ocf-shellfuncs for rabbitmq-server

    If the upstream script is not available, we were using the ocf_log
    command but missed the inclusion of the ocf-shellfuncs that defined it.
    This pulls in ocf-shellfuncs so we will not improperly error when the
    timing condition occurs.

    Change-Id: If058cb1e7eaafbdc962ba17d4d730924cc086cc5
    Closes-Bug: #1599479

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Invalid for Mitaka unless we backport https://review.openstack.org/#/c/317912/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.