manual upgrades doesn't work for rabbitmq

Bug #1586148 reported by Michał Jastrzębski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Undecided
Jean-Philippe Evrard

Bug Description

After following this guide: https://github.com/openstack/openstack-ansible/blob/master/doc/source/upgrade-guide/manual-upgrade.rst playbook setup-infrastructure failed:

Task: Enable rabbitmq mirroring

msg: Error:****@infra01-rabbit-mq-container-9c7a01e2'
- home dir: /var/lib/rabbitmq
- cookie hash: H/vV6HZ+7i2GW1ok5mU3sg==
failed: [infra02_rabbit_mq_container-a2122727] => {"cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_policies -p /", "failed": true, "rc": 69}
stderr: Error: unable to connect to node 'rabbit@infra02-rabbit-mq-container-a2122727': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@infra02-rabbit-mq-container-a2122727']

rabbit@infra02-rabbit-mq-container-a2122727:
connected to epmd (port 4369) on infra02-rabbit-mq-container-a2122727 epmd reports node 'rabbit' running on port 25672
TCP connection succeeded but Erlang distribution failed suggestion: hostname mismatch?
suggestion: is the cookie set correctly? suggestion: is the Erlang distribution using TLS?

current node details: - node name: 'rabbitmq-cli-94@infra02-rabbit-mq-container-a2122727'
- home dir: /var/lib/rabbitmq
- cookie hash: H/vV6HZ+7i2GW1ok5mU3sg==

Changed in openstack-ansible:
assignee: nobody → Jean-Philippe Evrard (jean-philippe-evrard)
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

It would appear that, due to the container name change in the log, this was an upgrade from Liberty->Mitaka. It would be pertinent for you to be using the appropriate *published* upgrade guide for Mitaka instead of that in the master branch. Mitaka upgrade documentation is published here: http://docs.openstack.org/developer/openstack-ansible/mitaka/upgrade-guide/index.html

Secondly, there are patches in progress to deal with name changes. Please check https://review.openstack.org/#/q/branch:stable/mitaka+project:%255Eopenstack/openstack-ansible.*+status:open to see their status.

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

Related patches:

Master: https://review.openstack.org/323033 / https://review.openstack.org/323504
Mitaka: https://review.openstack.org/312274

The Master patches would likely be backported once merged.

Changed in openstack-ansible:
assignee: Jean-Philippe Evrard (jean-philippe-evrard) → Kevin Carter (kevin-carter)
status: New → In Progress
Changed in openstack-ansible:
assignee: Kevin Carter (kevin-carter) → Amy Marrich (amy-marrich)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/323033
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=1d290828b94e0d34ac3e3cdd3a76c0f3fa0408cf
Submitter: Jenkins
Branch: master

commit 1d290828b94e0d34ac3e3cdd3a76c0f3fa0408cf
Author: Kevin Carter <email address hidden>
Date: Mon May 30 17:12:13 2016 -0500

    RFC1034/5 hostname upgrade

    The changes created here allow for upgrades to take place
    without impacting cluster availability in cases where a
    a service may be dependent on a non-compliant hostname(s).

    Upgrade playbook has been added for ensuring hostname aliases
    are correctly created. Specific entries for nova, heat, cinder
    neutron, galera and rabbitmq have been added to ensure all
    nodes are able to contact all other nodes using a potentially
    non-compliant hostname entry.

    To make setting the domain name easy across the cluster a new
    global variable has been created ``openstack_domain``. This
    variable has a default value of "openstack.local".

    Because the initial release of Mitaka (13.0.0) did not contain
    the RFC1034/5 updates these changes are needed to guarentee
    clusters deployed on our initial release are upgradable to
    Newton (14.0.0).

    Partial-Bug: #1577245
    Partial-Bug: #1586148
    Related-Change-Id: Ib1e3b6f02758906e3ec7ab35737c1a58fcbca216
    Change-Id: I6901409c1dc5ac8ff4f0af988132b5ac71f6379e
    Signed-off-by: Kevin Carter <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326657

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/mitaka)

Reviewed: https://review.openstack.org/326657
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=1cd08fa3905d6654d257b1c094a70991501ca997
Submitter: Jenkins
Branch: stable/mitaka

commit 1cd08fa3905d6654d257b1c094a70991501ca997
Author: Kevin Carter <email address hidden>
Date: Mon May 30 17:12:13 2016 -0500

    RFC1034/5 hostname upgrade

    The changes created here allow for upgrades to take place
    without impacting cluster availability in cases where a
    a service may be dependent on a non-compliant hostname(s).

    Upgrade playbook has been added for ensuring hostname aliases
    are correctly created. Specific entries for nova, heat, cinder
    neutron, galera and rabbitmq have been added to ensure all
    nodes are able to contact all other nodes using a potentially
    non-compliant hostname entry.

    To make setting the domain name easy across the cluster a new
    global variable has been created ``openstack_domain``. This
    variable has a default value of "openstack.local".

    Because the initial release of Mitaka (13.0.0) did not contain
    the RFC1034/5 updates these changes are needed to guarentee
    clusters deployed on our initial release are upgradable to
    Newton (14.0.0).

    Partial-Bug: #1577245
    Partial-Bug: #1586148
    Related-Change-Id: Ib1e3b6f02758906e3ec7ab35737c1a58fcbca216
    Change-Id: I6901409c1dc5ac8ff4f0af988132b5ac71f6379e
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit 1d290828b94e0d34ac3e3cdd3a76c0f3fa0408cf)

tags: added: in-stable-mitaka
Changed in openstack-ansible:
assignee: Amy Marrich (amy-marrich) → Jean-Philippe Evrard (jean-philippe-evrard)
Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

I'd be happy to know which version you are upgrading from and to.

Recently this patch merged: https://review.openstack.org/#/c/329628/ maybe it could be helpful for you.

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :
Download full text (7.2 KiB)

When attempting to upgrade from Liberty to Mitaka, the following occurs during the "Upgrade infrastructure" portion - http://docs.openstack.org/developer/openstack-ansible/mitaka/upgrade-guide/manual-upgrade.html#upgrade-infrastructure

TASK: [rabbitmq_server | Enable queue mirroring] ******************************
failed: [infra01_rabbit_mq_container-c77d985b] => {"cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_policies -p /", "failed": true, "rc": 69}
stderr: Error: unable to connect to node 'rabbit@infra01-rabbit-mq-container-c77d985b': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@infra01-rabbit-mq-container-c77d985b']

rabbit@infra01-rabbit-mq-container-c77d985b:
  * connected to epmd (port 4369) on infra01-rabbit-mq-container-c77d985b
  * epmd reports node 'rabbit' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?

current node details:
- node name: 'rabbitmq-cli-01@infra01-rabbit-mq-container-c77d985b'
- home dir: /var/lib/rabbitmq
- cookie hash: Wew1JBUuoh6lb/vRK+M1xg==

msg: Error:********@infra01-rabbit-mq-container-c77d985b'
- home dir: /var/lib/rabbitmq
- cookie hash: Wew1JBUuoh6lb/vRK+M1xg==
failed: [infra03_rabbit_mq_container-ba2a7c82] => {"cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_policies -p /", "failed": true, "rc": 69}
stderr: Error: unable to connect to node 'rabbit@infra03-rabbit-mq-container-ba2a7c82': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@infra03-rabbit-mq-container-ba2a7c82']

rabbit@infra03-rabbit-mq-container-ba2a7c82:
  * connected to epmd (port 4369) on infra03-rabbit-mq-container-ba2a7c82
  * epmd reports node 'rabbit' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?

current node details:
- node name: 'rabbitmq-cli-38@infra03-rabbit-mq-container-ba2a7c82'
- home dir: /var/lib/rabbitmq
- cookie hash: Wew1JBUuoh6lb/vRK+M1xg==

msg: Error:********@infra03-rabbit-mq-container-ba2a7c82'
- home dir: /var/lib/rabbitmq
- cookie hash: Wew1JBUuoh6lb/vRK+M1xg==
failed: [infra02_rabbit_mq_container-0c14a895] => {"cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_policies -p /", "failed": true, "rc": 69}
stderr: Error: unable to connect to node 'rabbit@infra02-rabbit-mq-container-0c14a895': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@infra02-rabbit-mq-container-0c14a895']

rabbit@infra02-rabbit-mq-container-0c14a895:
  * connected to epmd (port 4369) on infra02-rabbit-mq-container-0c14a895
  * epmd reports node 'rabbit' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?

current node details:
- node name: 'rabbitmq-cli-04@infra02-rabbit-mq-container-0c14a895'
- home dir: /var/lib/rabbitmq
- cookie hash: Wew1JBUuoh6lb/vRK+M1xg==

msg: Error:********@infra02-rabbit...

Read more...

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

Additionally this is OpenStack Liberty install without any modifications (VMs, Neutron networks, SSH Keypairs, etc). Install Liberty > Upgrade to Mitaka only.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

Are you sure all the playbooks have ran?
You should have no facts, applied old-hostname-compatibility, restarted rabbitmq with restart-rabbitmq-containers, and only then applied the setup-infrastructure.

Could you confirm ?

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

Yes, I missed the step to restart rabbitmq, not sure why/how I just blatantly skipped that step but looking at history appears to have been the case. Will try again.

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

root@deployment:/opt/openstack-ansible/playbooks# time openstack-ansible "${UPGRADE_PLAYBOOKS}/restart-rabbitmq-containers.yml" -vvvv|tee -a /root/upgrade3.txt
Variable files: "-e @/etc/openstack_deploy/user_secrets.yml -e @/etc/openstack_deploy/user_variables.yml "

PLAY [Restart Rabbitmq containers] ********************************************

TASK: [Restart node] **********************************************************
<172.22.102.79> ESTABLISH CONNECTION FOR USER: root
<172.22.102.79> REMOTE_MODULE command reboot #USE_SHELL
<172.22.102.79> EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=120 172.22.102.79 /bin/sh -c 'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python'
changed: [infra01_rabbit_mq_container-6e43f4d0] => {"changed": true, "cmd": "reboot", "delta": "0:00:00.004469", "end": "2016-06-23 16:19:42.360138", "rc": 0, "start": "2016-06-23 16:19:42.355669", "stderr": "", "stdout": "", "warnings": []}

TASK: [Wait for Rabbitmq aliveness] *******************************************
<localhost> REMOTE_MODULE wait_for host=infra01_rabbit_mq_container-6e43f4d0
<localhost> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1466716783.37-272638957794754 && echo $HOME/.ansible/tmp/ansible-tmp-1466716783.37-272638957794754']
<localhost> PUT /tmp/tmpYIGzDT TO /root/.ansible/tmp/ansible-tmp-1466716783.37-272638957794754/wait_for
<localhost> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1466716783.37-272638957794754/wait_for; rm -rf /root/.ansible/tmp/ansible-tmp-1466716783.37-272638957794754/ >/dev/null 2>&1']
failed: [infra01_rabbit_mq_container-6e43f4d0 -> localhost] => {"elapsed": 300, "failed": true}
msg: Timeout when waiting for infra01_rabbit_mq_container-6e43f4d0:5672

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/root/restart-rabbitmq-containers.retry

infra01_rabbit_mq_container-6e43f4d0 : ok=1 changed=1 unreachable=0 failed=1

real 5m2.302s
user 0m0.608s
sys 0m0.140s

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

root@deployment:/opt/openstack-ansible/playbooks# ansible -i inventory rabbitmq_all -m wait_for -a 'port=5672 delay=5 host="{{ inventory_hostname }}"'
infra03_rabbit_mq_container-202e9962 | success >> {
    "changed": false,
    "elapsed": 5,
    "path": null,
    "port": 5672,
    "search_regex": null,
    "state": "started"
}

infra02_rabbit_mq_container-c0155b46 | success >> {
    "changed": false,
    "elapsed": 5,
    "path": null,
    "port": 5672,
    "search_regex": null,
    "state": "started"
}

infra01_rabbit_mq_container-6e43f4d0 | success >> {
    "changed": false,
    "elapsed": 5,
    "path": null,
    "port": 5672,
    "search_regex": null,
    "state": "started"
}

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

Deployment node is not infra01 and did/does not have any of the containers in its /etc/hosts file; just the nodes it is deploying OpenStack on. Adding the rabbit containers to the /etc/hosts of the deployment node resolved this. I am confident - though not entirely sure - the issue is with the line

delegate_to: localhost

in /opt/openstack-ansible/scripts/upgrade-utilities/playbooks/restart-rabbitmq-containers.yml

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

The delegate_to is fine, the host isn't. I'll adapt this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/333813
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=fd272928c57c8b2dcb73148997e848a1bfe4e23e
Submitter: Jenkins
Branch: stable/mitaka

commit fd272928c57c8b2dcb73148997e848a1bfe4e23e
Author: Jean-Philippe Evrard <email address hidden>
Date: Fri Jun 24 08:32:38 2016 +0000

    Fix rabbitmq restart on non-rabbitmq accessible nodes

    During the upgrade playbook, rabbitmq is restarted. The deploy
    node will try to contact rabbitmq directly, based on its
    inventory hostname. If the inventory hostname is unknown to the
    deployment node, the playbook will fail.

    The rabbitmq playbook now delegates the rabbitmq check to the
    physical host having the rabbitmq node, on the IP used for
    SSH'ing, instead of the hostname.

    Closes-Bug: 1586148
    Change-Id: I6b267a1fca8e894142d2a0a5de69d5ee3d333875

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

Sorry so late on this, was not able to get this testing but am doing so now. Working on lab environment build and then will run through the upgrade documentation once again.

Revision history for this message
Melvin Hillsman (mrhillsman) wrote :

Confirmed successful when deployment node is not an infrastructure node:

root@melv7301-rpcops-lab:~/openstack-ansible/playbooks# cat ${UPGRADE_PLAYBOOKS}/restart-rabbitmq-containers.yml
---
# Copyright 2016, Rackspace US, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

- name: Restart Rabbitmq containers
  hosts: rabbitmq_all
  serial: 1
  tasks:
    - name: Restart node
      shell: reboot
    - name: Wait for Rabbitmq aliveness
      wait_for:
        port: 5672
        host: "{{ ansible_ssh_host }}"
        delay: 5
      delegate_to: "{{ physical_host }}"

root@melv7301-rpcops-lab:~/openstack-ansible/playbooks# openstack-ansible "${UPGRADE_PLAYBOOKS}/restart-rabbitmq-containers.yml"
Variable files: "-e @/etc/openstack_deploy/user_secrets.yml -e @/etc/openstack_deploy/user_variables.yml "

PLAY [Restart Rabbitmq containers] ********************************************

TASK: [Restart node] **********************************************************
changed: [infra01_rabbit_mq_container-58da3815]

TASK: [Wait for Rabbitmq aliveness] *******************************************
ok: [infra01_rabbit_mq_container-58da3815 -> infra01]

TASK: [Restart node] **********************************************************
changed: [infra03_rabbit_mq_container-08f31030]

TASK: [Wait for Rabbitmq aliveness] *******************************************
ok: [infra03_rabbit_mq_container-08f31030 -> infra03]

TASK: [Restart node] **********************************************************
changed: [infra02_rabbit_mq_container-2e969d55]

TASK: [Wait for Rabbitmq aliveness] *******************************************
ok: [infra02_rabbit_mq_container-2e969d55 -> infra02]

PLAY RECAP ********************************************************************
infra01_rabbit_mq_container-58da3815 : ok=2 changed=1 unreachable=0 failed=0
infra02_rabbit_mq_container-2e969d55 : ok=2 changed=1 unreachable=0 failed=0
infra03_rabbit_mq_container-08f31030 : ok=2 changed=1 unreachable=0 failed=0

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 13.1.4

This issue was fixed in the openstack/openstack-ansible 13.1.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.