os-refresh-config run gets stuck on rabbitmq restart

Bug #1334314 reported by Tom Howley on 2014-06-25
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
tripleo
High
Nicholas Randon

Bug Description

Sometimes, when deploying multiple overcloud control nodes, os-refresh-config gets stuck on all of the controller nodes. os-collect-config.log is stuck here:

dib-run-parts Wed Jun 25 13:37:05 UTC 2014 20-haproxy completed
dib-run-parts Wed Jun 25 13:37:05 UTC 2014 Running /opt/stack/os-config-refresh/post-configure.d/40-rabbitmq
+ '[' -d /var/run/rabbitmq ']'
+ '[' -d /mnt/state/var/log/rabbitmq ']'
++ lsb_release -si
+ DISTRO=Debian
++ lsb_release -sc
+ CODENAME=n/a
+ '[' Debian = Ubuntu -a n/a = saucy ']'
+ os-svc-enable -n rabbitmq-server
+ os-svc-restart -n rabbitmq-server
rabbitmq-server stop/waiting

Tom Howley (tom-howley) wrote :

rabbitmq process looks like this:

root 11981 0.0 0.0 4316 620 ? Ss 15:09 0:00 /bin/sh /usr/sbin/rabbitmqctl stop /var/run/rabbitmq/pid
root 11990 0.0 0.0 42496 1344 ? S 15:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "stop" "/var/run/rabbitmq/pid"
rabbitmq 11991 0.0 0.0 4316 376 ? Ss 15:09 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "stop" "/var/run/rabbitmq/pid"
rabbitmq 11992 0.1 0.3 420340 14104 ? Sl 15:09 0:00 \_ /usr/lib/erlang/erts-6.0/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -hidden -s
rabbitmq 12022 0.0 0.0 11444 452 ? Ss 15:09 0:00 \_ inet_gethost 4
rabbitmq 12023 0.0 0.0 13544 624 ? S 15:09 0:00 \_ inet_gethost 4

Tom Howley (tom-howley) wrote :

Correction, the above shows where the rabbitmq stop part of the restart is at.

Ben Nemec (bnemec) on 2014-06-26
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Cian O'Driscoll (dricco) wrote :

This is easily reproduced by restarting the seed and running os-refresh-config

Some other notes

root@hLinux:/opt/stack/os-config-refresh# service rabbitmq-server stop
stop: Job has already been stopped: rabbitmq-server

but service is still running

only the following manages to stop rabbit

root@hLinux:/opt/stack/os-config-refresh# /usr/sbin/rabbitmq-server stop
ERROR: node with name "rabbit" already running on "hLinux"

DIAGNOSTICS
===========

nodes in question: [rabbit@hLinux]

hosts, their running nodes and ports:
- hLinux: [{rabbitmqctl2187,57162},
           {rabbit,5535},
           {rabbitmqprelaunch10094,37627}]

current node details:
- node name: rabbitmqprelaunch10094@hLinux
- home dir: /mnt/state/var/lib/rabbitmq
- cookie hash: 4hgbUBSmcVnMIwyP4NbFtg==

Cian O'Driscoll (dricco) wrote :

strace showing the same thing os-svc-restart hang

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2bd668c9d0) = 6241
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x443640, [], SA_RESTORER, 0x7f2bd5cd4480}, {SIG_DFL, [], SA_RESTORER, 0x7f2bd5cd4480}, 8) = 0
wait4(-1, stop: Job has already been stopped: rabbitmq-server
^C <unfinished ...>
root@hLinux:/etc/rabbitmq# ps aux | grep rabbit
root 2158 0.0 0.0 4312 616 ? Ss 15:04 0:00 /bin/sh /usr/sbin/rabbitmq-server
root 2160 0.0 0.0 4312 616 ? Ss 15:04 0:00 /bin/sh /usr/sbin/rabbitmqctl wait /var/run/rabbitmq/pid
root 2175 0.0 0.0 42488 1372 ? S 15:04 0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
root 2181 0.0 0.0 42488 1372 ? S 15:04 0:00 su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/var/run/rabbitmq/pid"
rabbitmq 2186 0.0 0.0 4312 380 ? Ss 15:04 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 2187 0.2 0.4 643764 39492 ? Sl 15:04 0:03 /usr/lib/erlang/erts-6.0/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@hLinux -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../plugins" -rabbit plugins_expand_dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/mnt/state/var/lib/rabbitmq/mnesia/rabbit@hLinux"
rabbitmq 2188 0.0 0.0 4312 380 ? Ss 15:04 0:00 sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/var/run/rabbitmq/pid"
rabbitmq 2190 0.1 0.1 306572 12540 ? Sl 15:04 0:01 /usr/lib/erlang/erts-6.0/bin/beam -- -root /usr/lib/erlang -progname erl -- -home /mnt/state/var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.1.5/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl2190 -boot start_clean -s rabbit_control_main -nodename rabbit@hLinux -extra wait /var/run/rabbitmq/pid
rabbitmq 2249 0.0 0.0 11468 332 ? S 15:04 0:00 /usr/lib/erlang/erts-6.0/bin/epmd -daemon
rabbitmq 2741 0.0 0.0 11432 452 ? Ss 15:04 0:00 inet_gethost 4
rabbitmq 2742 0.0 0.0 17740 672 ? S 15:04 0:00 inet_gethost 4
root 6257 0.0 0.0 10464 860 pts/1 S+ 15:27 0:00 grep rabbit

Changed in tripleo:
assignee: nobody → Nicholas Randon (nicholas-randon)
Changed in tripleo:
status: Triaged → In Progress

Note there was a bug in rabbitmq (3.1.0) that prevented the stop from working correctly:

    26027 ensure autoheal does not hang winner node if 'rabbitmqctl stop_app'
    issued on other node during healing (since 3.1.0)

This is the reason for the lock-up on stop. Fixing this revealed the clustering restart scripting is not robust and needs some rework.

Changed in tripleo:
assignee: Nicholas Randon (nicholas-randon) → Alexis Lee (alexisl)
Changed in tripleo:
assignee: Alexis Lee (alexisl) → Nicholas Randon (nicholas-randon)
Changed in tripleo:
assignee: Nicholas Randon (nicholas-randon) → Alexis Lee (alexisl)
Alexis Lee (alexisl) on 2014-10-27
Changed in tripleo:
assignee: Alexis Lee (alexisl) → nobody
Changed in tripleo:
assignee: nobody → Nicholas Randon (nicholas-randon)

Reviewed: https://review.openstack.org/115524
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=b4f59ef86bdaa09554c6741661a13a8f3fb12cba
Submitter: Jenkins
Branch: master

commit b4f59ef86bdaa09554c6741661a13a8f3fb12cba
Author: Nicholas Randon <email address hidden>
Date: Mon Aug 18 19:18:31 2014 +0100

    Fix RabbitMQ element clustering start and stop

    Prevent upstart respawning from SIGTERM due to post-stop pkill running.

    Separate config files out of the install.d script into files to help
    readability.

    Renumber 40-rabbitmq to 51-rabbitmq and 50-rabbitmq-passwords to
    52-rabbitmq-passwords so that ntp runs before these scripts

    Graceful start and stop, to prevent split-brain issues. In the non-cluster
    case: just restart.

    In the cluster case: stop everything gracefully. Start everything. Join
    bootstrap node if not bootstrap, otherwise join any node. This prevents getting
    two disjoint clusters.

    "graceful" means RAM nodes sync with disk nodes before they stop. If they are
    stopped unceremoniously, they lose data.

    Closes-Bug: #1334314
    Change-Id: Ic758256481fdd31d10f4e4a341ae93cb372a0766

Changed in tripleo:
status: In Progress → Fix Committed
Derek Higgins (derekh) on 2014-12-17
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers