tripleo

os-refresh-config run gets stuck on rabbitmq restart

Bug #1334314 reported by Tom Howley on 2014-06-25

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Nicholas Randon

Bug Description

Sometimes, when deploying multiple overcloud control nodes, os-refresh-config gets stuck on all of the controller nodes. os-collect-config.log is stuck here:

dib-run-parts Wed Jun 25 13:37:05 UTC 2014 20-haproxy completed
dib-run-parts Wed Jun 25 13:37:05 UTC 2014 Running /opt/stack/os-config-refresh/post-configure.d/40-rabbitmq
+ '[' -d /var/run/rabbitmq ']'
+ '[' -d /mnt/state/var/log/rabbitmq ']'
++ lsb_release -si
+ DISTRO=Debian
++ lsb_release -sc
+ CODENAME=n/a
+ '[' Debian = Ubuntu -a n/a = saucy ']'
+ os-svc-enable -n rabbitmq-server
+ os-svc-restart -n rabbitmq-server
rabbitmq-server stop/waiting

Revision history for this message

Tom Howley (tom-howley) wrote on 2014-06-25:

rabbitmq process looks like this:

root 11981 0.0 0.0 4316 620 ? Ss 15:09 0:00 /bin/sh /usr/sbin/rabbitmqctl stop /var/run/rabbitmq/pid
root 11990 0.0 0.0 42496 1344 ? S 15:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "stop" "/var/run/rabbitmq/pid"
rabbitmq 11991 0.0 0.0 4316 376 ? Ss 15:09 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "stop" "/var/run/rabbitmq/pid"
rabbitmq 11992 0.1 0.3 420340 14104 ? Sl 15:09 0:00 \_ /usr/lib/erlang/erts-6.0/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -hidden -s
rabbitmq 12022 0.0 0.0 11444 452 ? Ss 15:09 0:00 \_ inet_gethost 4
rabbitmq 12023 0.0 0.0 13544 624 ? S 15:09 0:00 \_ inet_gethost 4

Revision history for this message

Tom Howley (tom-howley) wrote on 2014-06-25:

Correction, the above shows where the rabbitmq stop part of the restart is at.

Ben Nemec (bnemec) on 2014-06-26

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Cian O'Driscoll (dricco) wrote on 2014-07-14:

This is easily reproduced by restarting the seed and running os-refresh-config

Some other notes

root@hLinux:/opt/stack/os-config-refresh# service rabbitmq-server stop
stop: Job has already been stopped: rabbitmq-server

but service is still running

only the following manages to stop rabbit

root@hLinux:/opt/stack/os-config-refresh# /usr/sbin/rabbitmq-server stop
ERROR: node with name "rabbit" already running on "hLinux"

DIAGNOSTICS
===========

nodes in question: [rabbit@hLinux]

hosts, their running nodes and ports:
- hLinux: [{rabbitmqctl2187,57162},
{rabbit,5535},
{rabbitmqprelaunch10094,37627}]

current node details:
- node name: rabbitmqprelaunch10094@hLinux
- home dir: /mnt/state/var/lib/rabbitmq
- cookie hash: 4hgbUBSmcVnMIwyP4NbFtg==

Revision history for this message

Cian O'Driscoll (dricco) wrote on 2014-07-14:

strace showing the same thing os-svc-restart hang

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2bd668c9d0) = 6241
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x443640, [], SA_RESTORER, 0x7f2bd5cd4480}, {SIG_DFL, [], SA_RESTORER, 0x7f2bd5cd4480}, 8) = 0
wait4(-1, stop: Job has already been stopped: rabbitmq-server
^C <unfinished ...>
root@hLinux:/etc/rabbitmq# ps aux | grep rabbit
root 2158 0.0 0.0 4312 616 ? Ss 15:04 0:00 /bin/sh /usr/sbin/rabbitmq-server
root 2160 0.0 0.0 4312 616 ? Ss 15:04 0:00 /bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
root 2175 0.0 0.0 42488 1372 ? S 15:04 0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
root 2181 0.0 0.0 42488 1372 ? S 15:04 0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/var/run/rabbitmq/pid"
rabbitmq 2186 0.0 0.0 4312 380 ? Ss 15:04 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 2187 0.2 0.4 643764 39492 ? Sl 15:04 0:03 /usr/lib/erlang/erts-6.0/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@hLinux -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../plugins" -rabbit plugins_expand_dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux"
rabbitmq 2188 0.0 0.0 4312 380 ? Ss 15:04 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/var/run/rabbitmq/pid"
rabbitmq 2190 0.1 0.1 306572 12540 ? Sl 15:04 0:01 /usr/lib/erlang/erts-6.0/bin/beam -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl2190 -boot start_clean -s rabbit_control_main -nodename rabbit@hLinux -extra wait /var/run/rabbitmq/pid
rabbitmq 2249 0.0 0.0 11468 332 ? S 15:04 0:00 /usr/lib/erlang/erts-6.0/bin/epmd -daemon
rabbitmq 2741 0.0 0.0 11432 452 ? Ss 15:04 0:00 inet_gethost 4
rabbitmq 2742 0.0 0.0 17740 672 ? S 15:04 0:00 inet_gethost 4
root 6257 0.0 0.0 10464 860 pts/1 S+ 15:27 0:00 grep rabbit

strace showing the same thing os-svc-restart hang

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2bd668c9d0) = 6241
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x443640, [], SA_RESTORER, 0x7f2bd5cd4480}, {SIG_DFL, [], SA_RESTORER, 0x7f2bd5cd4480}, 8) = 0
wait4(-1, stop: Job has already been stopped: rabbitmq-server
^C <unfinished ...>
root@hLinux:/etc/rabbitmq# ps aux | grep rabbit
root      2158  0.0  0.0   4312   616 ?        Ss   15:04   0:00 /bin/sh /usr/sbin/rabbitmq-server
root      2160  0.0  0.0   4312   616 ?        Ss   15:04   0:00 /bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
root      2175  0.0  0.0  42488  1372 ?        S    15:04   0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server 
root      2181  0.0  0.0  42488  1372 ?        S    15:04   0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "wait" "/var/run/rabbitmq/pid"
rabbitmq  2186  0.0  0.0   4312   380 ?        Ss   15:04   0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmq-server 
rabbitmq  2187  0.2  0.4 643764 39492 ?        Sl   15:04   0:03 /usr/lib/erlang/erts-6.0/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@hLinux -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/mnt/state/var/log/rabbitmq/rabbit@hLinux.log"} -rabbit sasl_error_logger {file,"/mnt/state/var/log/rabbitmq/rabbit@hLinux-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../plugins" -rabbit plugins_expand_dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux"
rabbitmq  2188  0.0  0.0   4312   380 ?        Ss   15:04   0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmqctl  "wait" "/var/run/rabbitmq/pid"
rabbitmq  2190  0.1  0.1 306572 12540 ?        Sl   15:04   0:01 /usr/lib/erlang/erts-6.0/bin/beam -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl2190 -boot start_clean -s rabbit_control_main -nodename rabbit@hLinux -extra wait /var/run/rabbitmq/pid
rabbitmq  2249  0.0  0.0  11468   332 ?        S    15:04   0:00 /usr/lib/erlang/erts-6.0/bin/epmd -daemon
rabbitmq  2741  0.0  0.0  11432   452 ?        Ss   15:04   0:00 inet_gethost 4
rabbitmq  2742  0.0  0.0  17740   672 ?        S    15:04   0:00 inet_gethost 4
root      6257  0.0  0.0  10464   860 pts/1    S+   15:27   0:00 grep rabbit

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-20: Fix proposed to tripleo-image-elements (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115524

Nicholas Randon (nicholas-randon) on 2014-08-20

Changed in tripleo:
assignee:	nobody → Nicholas Randon (nicholas-randon)

OpenStack Infra (hudson-openstack) on 2014-08-20

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Nicholas Randon (nicholas-randon) wrote on 2014-09-16:

Note there was a bug in rabbitmq (3.1.0) that prevented the stop from working correctly:

26027 ensure autoheal does not hang winner node if 'rabbitmqctl stop_app'
issued on other node during healing (since 3.1.0)

This is the reason for the lock-up on stop. Fixing this revealed the clustering restart scripting is not robust and needs some rework.

OpenStack Infra (hudson-openstack) on 2014-10-24

Changed in tripleo:
assignee:	Nicholas Randon (nicholas-randon) → Alexis Lee (alexisl)

Nicholas Randon (nicholas-randon) on 2014-10-24

Changed in tripleo:
assignee:	Alexis Lee (alexisl) → Nicholas Randon (nicholas-randon)

OpenStack Infra (hudson-openstack) on 2014-10-27

Changed in tripleo:
assignee:	Nicholas Randon (nicholas-randon) → Alexis Lee (alexisl)

Alexis Lee (alexisl) on 2014-10-27

Changed in tripleo:
assignee:	Alexis Lee (alexisl) → nobody

Nicholas Randon (nicholas-randon) on 2014-10-27

Changed in tripleo:
assignee:	nobody → Nicholas Randon (nicholas-randon)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-05: Fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/115524
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=b4f59ef86bdaa09554c6741661a13a8f3fb12cba
Submitter: Jenkins
Branch: master

commit b4f59ef86bdaa09554c6741661a13a8f3fb12cba
Author: Nicholas Randon <email address hidden>
Date: Mon Aug 18 19:18:31 2014 +0100

Fix RabbitMQ element clustering start and stop

Prevent upstart respawning from SIGTERM due to post-stop pkill running.

Separate config files out of the install.d script into files to help
readability.

Renumber 40-rabbitmq to 51-rabbitmq and 50-rabbitmq-passwords to
52-rabbitmq-passwords so that ntp runs before these scripts

Graceful start and stop, to prevent split-brain issues. In the non-cluster
case: just restart.

    In the cluster case: stop everything gracefully. Start everything. Join
    bootstrap node if not bootstrap, otherwise join any node. This prevents getting
    two disjoint clusters.

"graceful" means RAM nodes sync with disk nodes before they stop. If they are
stopped unceremoniously, they lose data.

Closes-Bug: #1334314
Change-Id: Ic758256481fdd31d10f4e4a341ae93cb372a0766

Changed in tripleo:
status:	In Progress → Fix Committed

Derek Higgins (derekh) on 2014-12-17

Changed in tripleo:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.