Adding of new controller fails with network templates: RabbitMQ server doesn't start on new node

Bug #1540915 reported by Artem Panchenko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel QA Team
8.0.x
Invalid
High
Fuel QA Team

Bug Description

Adding of new controller fails with networking templates, because RabbitMQ server doesn't start on new node during deployment:

2016-02-02 00:54:09 DEBUG [787] Node 4(controller) status: error
2016-02-02 00:54:09 DEBUG [787] Node 4 has failed to deploy. There is no more retries for puppet run.
2016-02-02 00:54:09 DEBUG [787] {"nodes"=>[{"status"=>"error", "error_type"=>"deploy", "uid"=>"4", "role"=>"controller"}]}
2016-02-02 00:54:09 ERROR [787] Task '{"priority"=>3800, "type"=>"puppet", "id"=>"rabbitmq", "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/rabbitmq/rabbitmq.pp", "timeout"=>3600, "cwd"=>"/"}, "uids"=>["4"]}' failed on node 4

Puppet logs:

2016-02-02T00:36:28.207330+00:00 debug: Executing '/usr/sbin/rabbitmqctl -q list_users'
2016-02-02T00:36:28.717067+00:00 debug: Fail: Execution of '/usr/sbin/rabbitmqctl -q list_users' returned 139: Error: unable to connect to node 'rabbit@messaging-node-4': nodedown
2016-02-02T00:36:28.717067+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: DIAGNOSTICS
2016-02-02T00:36:28.717383+00:00 debug: ===========
2016-02-02T00:36:28.717383+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: attempted to contact: ['rabbit@messaging-node-4']
2016-02-02T00:36:28.717383+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: rabbit@messaging-node-4:
2016-02-02T00:36:28.717383+00:00 debug: * connected to epmd (port 4369) on messaging-node-4
2016-02-02T00:36:28.717383+00:00 debug: * epmd reports: node 'rabbit' not running at all
2016-02-02T00:36:28.717383+00:00 debug: other nodes on messaging-node-4: ['rabbitmq-cli-32534']
2016-02-02T00:36:28.718276+00:00 debug: * suggestion: start the node
2016-02-02T00:36:28.718276+00:00 debug:
2016-02-02T00:36:28.718276+00:00 debug: current node details:
2016-02-02T00:36:28.718276+00:00 debug: - node name: 'rabbitmq-cli-32534@node-4'
2016-02-02T00:36:28.718276+00:00 debug: - home dir: /var/lib/rabbitmq
2016-02-02T00:36:28.718276+00:00 debug: - cookie hash: soeIWU2jk2YNseTyDSlsEA==
2016-02-02T00:36:28.718276+00:00 debug:
2016-02-02T00:36:28.718276+00:00 debug: Segmentation fault Retry: 30
2016-02-02T00:36:34.721127+00:00 err: (/Stage[main]/Rabbitmq::Management/Rabbitmq_user[guest]) Could not evaluate: Command is still failing after 180 seconds expired!
...
2016-02-02T00:54:07.335991+00:00 err: curl -k --noproxy localhost --retry 30 --retry-delay 6 -f -L -o /var/lib/rabbitmq/rabbitmqadmin http://nova:w5uDGpv4C0ZicVdIovLSCJhu@localhost:15672/cli/rabbitmqadmin returned 7 instead of one of [0]

Pacemaker logs:

<27>Feb 2 00:40:25 node-4 crmd[6003]: error: process_lrm_event: Operation p_rabbitmq-server_start_0: Timed Out (node=node-4.test.domain.local, call=179, timeout=360000ms)

RabbitMQ startup log is attached. Also when I'm trying to execute ocf script manually I get this:

http://paste.openstack.org/show/485724/
http://paste.openstack.org/show/485725/

This issue could be reproduced by 'add_nodes_net_tmpl' system test:

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_net_templates.py#L203

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like not a dup, here the rabbitmq-server was failing to start all of the time

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Mescheryakov (dmitrymex)
tags: added: area-mos
removed: area-library
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Reason for the crash is ["inet_tcp",eaddrinuse]
So there is several possibilities:
1) Something unrelated is listening on port 25672
2) Rabbit somehow became de-registered from epmd (or it was killed) - in worst case it'll take rabbit 60 seconds to register in epmd again

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I'm not familiar with deployment process, but pacemakers starts rabbit monitoring even before the package is installed. Is it normal?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Tentatively moving this to 8.0-updates. We'll raise the priority to Critical, if this absolutely needs to be fixed in 8.0. The investigation continues.

tags: added: move-to-mu
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

This is not a duplicate of bug 1541029 - here rabbitmq simply refused to start all the time. Atop logs confirm that beam process appeared for some time, then disappear again, then appeared again with new pid and so on. BTW I am not sure if that is related, but epmd lived through all this time.

QA team, we need a live reproduction to investigate the issue.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Dmitry Mescheryakov (dmitrymex) → Fuel QA Team (fuel-qa)
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

Looks like it has beeen reproduced at
https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.services_reconfiguration_thread_1/29/console
Building remotely on srv88-bud.infra.mirantis.net
source /home/jenkins/venv-nailgun-tests-2.9/bin/activate; dos.py revert-resume 8.0.system_test.ubuntu.services_reconfiguration_thread_1.29.29 error_reconfiguration_scalability && ssh root@10.109.38.2
on node-6

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

>I'm not familiar with deployment process, but pacemakers starts rabbit monitoring even before the package is installed. Is it normal?

Normally, first the package must be installed so all of the binaries are in place. Next, the pacemaker resource must be created, which starts the OCF RA. So, this is a race bug perhaps

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, the logs in the comment #10 also point there is a race in the package install vs OCF RA resource creation in Pacemaker http://pastebin.com/fmMsRyCZ

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/289860

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/289860
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9be170dafa906b6934c3091f5994a65705e40a03
Submitter: Jenkins
Branch: master

commit 9be170dafa906b6934c3091f5994a65705e40a03
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Mar 7 15:39:58 2016 +0100

    Fix race of the rabbitmq OCF RA vs package install

    Closes-bug: #1553077
    Related-bug: #1540915

    Change-Id: I89e27e136062a0c3508a337abf81381bcdd56790
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Expired

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.