[10.0][BVT] Deployment process stops after failure of task rabbitmq on one of the nodes

Bug #1626933 reported by Roman Podoliaka
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
MOS Oslo
9.x
Confirmed
Medium
MOS Oslo

Bug Description

A recent run ( https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/702/ ) of 10.0 BVT job failed with the following error:

2016-09-23 00:26:05 DEBUG [16227] Cluster[]: Process node: Node[3]
2016-09-23 00:26:05 DEBUG [16227] Node[3]: Node 3: task rabbitmq, task status running
2016-09-23 00:26:05 ERROR [16227] Node 3(rabbitmq) status: error
2016-09-23 00:26:05 DEBUG [16227] Node 3 has failed to deploy. There is no more retries for puppet run.
2016-09-23 00:26:05 DEBUG [16227] {"nodes"=>[{"status"=>"error", "error_type"=>"deploy", "uid"=>"3", "role"=>"rabbitmq"}]}
2016-09-23 00:26:05 DEBUG [16227] Task time summary: rabbitmq with status failed on node 3 took 00:12:54

after that deployment goes to error state:

2016-09-23 00:29:45 INFO [16227] Cluster[]: All nodes are finished. Failed tasks: Task[rabbitmq/3] Stopping the deployment process!

The task itself fails with:

/usr/lib/ruby/vendor_ruby/puppet/util/command_line.rb:92:in `execute'
/usr/bin/puppet:8:in `<main>'
2016-09-23 00:24:58 +0000 /Stage[main]/Osnailyfacter::Rabbitmq::Rabbitmq/Rabbitmq_user[nova]/password (err): change from to <new password> failed: Execution of '/usr/sbin/rabbitmqctl eval rabbit_access_control:check_user_pass_login(list_to_binary("nova"), list_to_binary("2ffclb6vSLuwp0TDGkJIWyof")).' returned 70: Error: {badarg,[{rabbit_misc,dirty_read,1,[]},
                {erlang,whereis,1,[]},
                {lists,foldl,3,[]},
                {erlang,whereis,1,[]},
                {rpc,'-handle_call_call/6-fun-0-',5,[]},
                {erlang,whereis,1,[]}]}
2016-09-23 00:26:03 +0000 /Stage[main]/Osnailyfacter::Rabbitmq::Rabbitmq/Rabbitmq_user_permissions[nova@/] (notice): Dependency Rabbitmq_user[nova] has failures: true
2016-09-23 00:26:03 +0000 /Stage[main]/Osnailyfacter::Rabbitmq::Rabbitmq/Rabbitmq_user_permissions[nova@/] (warning): Skipping because of failed dependencies
2016-09-23 00:26:05 +0000 Puppet (notice): Finished catalog run in 89.28 seconds

Tags: area-oslo
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in mos:
milestone: none → 10.0
importance: Undecided → Medium
status: New → Confirmed
tags: added: area-oslo
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

The most probable cause is that 'mnesia' and 'rabbit' applications were stopped inside Erlang VM (or didn't have time to complete their startup) when this command failed.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

It seems that the root cause of the issue is that RabbitMQ restart took too much time on node-3: it went down at 00:17 and started back only at 00:26, as it can be seen in lrmd.log from node-3. The restart itself was triggered by updating host_ip OCF parameter.

The cause of long restart seem to lie in that stop action failed:
2016-09-23T00:17:33.532795+00:00 err: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually!
2016-09-23T00:17:33.538996+00:00 info: INFO: p_rabbitmq-server[10049]: stop: action end.

It led Pacemaker to consider it failed:
Sep 23 00:17:33 [9134] node-1.test.domain.local attrd: info: attrd_cib_callback: Update 151 for fail-count-p_rabbitmq-server[node-3.test.domain.local]=INFINITY: OK (0)

To sum up: we need to fix OCF script stop action so that it does not fail sporadically. The fix will benefit Mitaka code as well.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This must be fixed upstream first, please link the PR?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

MOS 10 consumes the OCF RA from the rabbit distro, thus invalid for the MOS scope

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

Fixed by rabbitmq 3.6.6 update for F10.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.