Fuel for OpenStack

Adding of new controller fails with network templates: RabbitMQ server doesn't start on new node

Bug #1540915 reported by Artem Panchenko on 2016-02-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Fuel QA Team	Fuel for OpenStack 9.0
	8.0.x	Invalid	High	Fuel QA Team	Fuel for OpenStack 8.0-updates

Bug Description

Adding of new controller fails with networking templates, because RabbitMQ server doesn't start on new node during deployment:

2016-02-02 00:54:09 DEBUG [787] Node 4(controller) status: error
2016-02-02 00:54:09 DEBUG [787] Node 4 has failed to deploy. There is no more retries for puppet run.
2016-02-02 00:54:09 DEBUG [787] {"nodes"=>[{"status"=>"error", "error_type"=>"deploy", "uid"=>"4", "role"=>"controller"}]}
2016-02-02 00:54:09 ERROR [787] Task '{"priority"=>3800, "type"=>"puppet", "id"=>"rabbitmq", "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/rabbitmq/rabbitmq.pp", "timeout"=>3600, "cwd"=>"/"}, "uids"=>["4"]}' failed on node 4

Puppet logs:

2016-02-02T00:36:28.207330+00:00 debug: Executing '/usr/sbin/rabbitmqctl -q list_users'
2016-02-02T00:36:28.717067+00:00 debug: Fail: Execution of '/usr/sbin/rabbitmqctl -q list_users' returned 139: Error: unable to connect to node 'rabbit@messaging-node-4': nodedown
2016-02-02T00:36:28.717067+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: DIAGNOSTICS
2016-02-02T00:36:28.717383+00:00 debug: ===========
2016-02-02T00:36:28.717383+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: attempted to contact: ['rabbit@messaging-node-4']
2016-02-02T00:36:28.717383+00:00 debug:
2016-02-02T00:36:28.717383+00:00 debug: rabbit@messaging-node-4:
2016-02-02T00:36:28.717383+00:00 debug: * connected to epmd (port 4369) on messaging-node-4
2016-02-02T00:36:28.717383+00:00 debug: * epmd reports: node 'rabbit' not running at all
2016-02-02T00:36:28.717383+00:00 debug: other nodes on messaging-node-4: ['rabbitmq-cli-32534']
2016-02-02T00:36:28.718276+00:00 debug: * suggestion: start the node
2016-02-02T00:36:28.718276+00:00 debug:
2016-02-02T00:36:28.718276+00:00 debug: current node details:
2016-02-02T00:36:28.718276+00:00 debug: - node name: 'rabbitmq-cli-32534@node-4'
2016-02-02T00:36:28.718276+00:00 debug: - home dir: /var/lib/rabbitmq
2016-02-02T00:36:28.718276+00:00 debug: - cookie hash: soeIWU2jk2YNseTyDSlsEA==
2016-02-02T00:36:28.718276+00:00 debug:
2016-02-02T00:36:28.718276+00:00 debug: Segmentation fault Retry: 30
2016-02-02T00:36:34.721127+00:00 err: (/Stage[main]/Rabbitmq::Management/Rabbitmq_user[guest]) Could not evaluate: Command is still failing after 180 seconds expired!
...
2016-02-02T00:54:07.335991+00:00 err: curl -k --noproxy localhost --retry 30 --retry-delay 6 -f -L -o /var/lib/rabbitmq/rabbitmqadmin http://nova:w5uDGpv4C0ZicVdIovLSCJhu@localhost:15672/cli/rabbitmqadmin returned 7 instead of one of [0]

Pacemaker logs:

<27>Feb 2 00:40:25 node-4 crmd[6003]: error: process_lrm_event: Operation p_rabbitmq-server_start_0: Timed Out (node=node-4.test.domain.local, call=179, timeout=360000ms)

RabbitMQ startup log is attached. Also when I'm trying to execute ocf script manually I get this:

http://paste.openstack.org/show/485724/
http://paste.openstack.org/show/485725/

This issue could be reproduced by 'add_nodes_net_tmpl' system test:

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_net_templates.py#L203

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-02-02:

startup_log Edit (2.4 KiB, text/plain)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-02-02:

fail_error_add_nodes_net_tmpl-fuel-snapshot-2016-02-02_00-54-33.tar.xz Edit (60.3 MiB, application/octet-stream)

Bogdan Dobrelya (bogdando) on 2016-02-02

tags:

added: area-library

Dmitry Pyzhov (dpyzhov) on 2016-02-03

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-02-03:

Duplicate of https://bugs.launchpad.net/fuel/+bug/1541029

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-02-03:

Looks like not a dup, here the rabbitmq-server was failing to start all of the time

Bogdan Dobrelya (bogdando) on 2016-02-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Dmitry Mescheryakov (dmitrymex)
tags:	added: area-mos removed: area-library

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-03:

Reason for the crash is ["inet_tcp",eaddrinuse]
So there is several possibilities:
1) Something unrelated is listening on port 25672
2) Rabbit somehow became de-registered from epmd (or it was killed) - in worst case it'll take rabbit 60 seconds to register in epmd again

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-03:

I'm not familiar with deployment process, but pacemakers starts rabbit monitoring even before the package is installed. Is it normal?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-03:

Tentatively moving this to 8.0-updates. We'll raise the priority to Critical, if this absolutely needs to be fixed in 8.0. The investigation continues.

tags:

added: move-to-mu

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-05:

This is not a duplicate of bug 1541029 - here rabbitmq simply refused to start all the time. Atop logs confirm that beam process appeared for some time, then disappear again, then appeared again with new pid and so on. BTW I am not sure if that is related, but epmd lived through all this time.

QA team, we need a live reproduction to investigate the issue.

Changed in fuel:
status:	Confirmed → Incomplete
assignee:	Dmitry Mescheryakov (dmitrymex) → Fuel QA Team (fuel-qa)

Revision history for this message

Alexandr Kostrikov (akostrikov-mirantis) wrote on 2016-02-17:

#10

Looks like it has beeen reproduced at
https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.services_reconfiguration_thread_1/29/console
Building remotely on srv88-bud.infra.mirantis.net
source /home/jenkins/venv-nailgun-tests-2.9/bin/activate; dos.py revert-resume 8.0.system_test.ubuntu.services_reconfiguration_thread_1.29.29 error_reconfiguration_scalability && ssh root@10.109.38.2
on node-6

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-03-07:

#11

>I'm not familiar with deployment process, but pacemakers starts rabbit monitoring even before the package is installed. Is it normal?

Normally, first the package must be installed so all of the binaries are in place. Next, the pacemaker resource must be created, which starts the OCF RA. So, this is a race bug perhaps

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-03-07:

#12

Note, the logs in the comment #10 also point there is a race in the package install vs OCF RA resource creation in Pacemaker http://pastebin.com/fmMsRyCZ

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-08: Related fix proposed to fuel-library (master)

#13

Related fix proposed to branch: master
Review: https://review.openstack.org/289860

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-09: Related fix merged to fuel-library (master)

#14

Reviewed: https://review.openstack.org/289860
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9be170dafa906b6934c3091f5994a65705e40a03
Submitter: Jenkins
Branch: master

commit 9be170dafa906b6934c3091f5994a65705e40a03
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Mar 7 15:39:58 2016 +0100

Fix race of the rabbitmq OCF RA vs package install

Closes-bug: #1553077
Related-bug: #1540915

Change-Id: I89e27e136062a0c3508a337abf81381bcdd56790
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-03-10:

#15

Expired

Changed in fuel:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.