Kolla-ansible deploy fails at rabbitmq

Bug #1855935 reported by Alex Jackson
42
This bug affects 6 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Invalid
Low
Radosław Piliszek

Bug Description

OS: Ubuntu 16.04 LTS
kernel: 4.4.0-170-generic
docker version: 19.03.2
Kolla-ansible branch: stein and train
docker install image: source
type of install: all-in-one

Kolla-ansible deploy fails at rabbitmq role due to the inability of the rabbitmq container to start

Expected clean install

Reproduction steps: install using stein or train

possible other contributing factors: network interfaces named eno1 and eno2

Willing to answer other questions about machine configuration, etc.

The error given by ansible:

RUNNING HANDLER [rabbitmq : Waiting for rabbitmq to start on first node] ********************************************* fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker exec rabbitmq rabbitmqctl wait /var/lib/rabbitmq/mnesia/rabbitmq.pid", "delta": "0:00:00.380240", "end": "2019-09-16 10:40:17.794725", "msg": "non-zero return code", "rc": 126, "start": "2019-09-16 10:40:17.414485", "stderr": "", "stderr_lines": [], "stdout": "cannot exec in a stopped state: unknown", "stdout_lines": ["cannot exec in a stopped state: unknown"]}

The error in the docker container:

ERROR: epmd error for host openStack: address (cannot connect to host/port)

Work-around:

Comment out

export ERL_EPMD_ADDRESS={{ api_interface_address }}

and replace with

export ERL_EMPD_ADDRESS="[ip address of neutron network_interface]"

see https://ask.openstack.org/en/question/124223/kolla-ansible-deploy-fail-for-rabbitmq/ as well, since someone else has has this problem

Alex Jackson (xelaot)
tags: added: kolla-ansible rabbitmq
tags: added: stein train
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

You are most likely hit by: https://bugs.launchpad.net/kolla-ansible/+bug/1853578

Could you verify that?

I don't understand what you mean by [ip address of neutron network_interface], neutron external interface has generally no address set.

As a side note, please upgrade the host to avoid other issues due to old kernel, the releases you mentioned are tested against bionic, not xenial (the release in images as well).

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Also:

is this MAAS?

Did you run prechecks and did they pass?

Could you include the output of:
cat /etc/hosts

getent hosts $(hostname)
getent hosts $(hostname -s)
getent hosts $(getent hosts $(hostname))
getent hosts $(getent hosts $(hostname -s))

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

(output on deployed node, not deployer)

Revision history for this message
Taisto Qvist (theque42) wrote :

I think I might be the second "someone else also has this problem", where I got the question if I was using MAAS. (in the ask-question)
The answer is no. This is a private setup to a bunch of preconfigured servers with already working hosts files.

The hosts files starts of as:
-----
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.103.100 ctrl1.lab3.stack ctrl1
172.16.103.100 controller1.lab3.stack controller1
172.16.103.109 ctrl2.lab3.stack ctrl2
172.16.103.109 controller2.lab3.stack controller2
172.16.103.101 compute1.lab3.stack compute1
172.16.103.102 compute2.lab3.stack compute2
172.16.103.103 compute3.lab3.stack compute3
172.16.103.104 neutron1.lab3.stack neutron1
172.16.103.105 neutron2.lab3.stack neutron2
172.16.103.106 storage1.lab3.stack storage1
172.16.103.107 storage2.lab3.stack storage2
172.16.103.108 storage3.lab3.stack storage3
172.16.103.111 int.lab3.stack int haint lbi
10.10.103.111 ext.lab3.stack ext haext lbx
-----
but after deploy it becomes:
-----
127.0.0.1 localhost
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.103.100 ctrl1.lab3.stack ctrl1
172.16.103.100 controller1.lab3.stack controller1
172.16.103.109 ctrl2.lab3.stack ctrl2
172.16.103.109 controller2.lab3.stack controller2
172.16.103.101 compute1.lab3.stack compute1
172.16.103.102 compute2.lab3.stack compute2
172.16.103.103 compute3.lab3.stack compute3
172.16.103.104 neutron1.lab3.stack neutron1
172.16.103.105 neutron2.lab3.stack neutron2
172.16.103.106 storage1.lab3.stack storage1
172.16.103.107 storage2.lab3.stack storage2
172.16.103.108 storage3.lab3.stack storage3
# BEGIN ANSIBLE GENERATED HOSTS
172.16.103.100 ctrl1.lab3.stack ctrl1
172.16.103.109 ctrl2.lab3.stack ctrl2
172.16.103.104 neutron1.lab3.stack neutron1
172.16.103.105 neutron2.lab3.stack neutron2
172.16.103.101 compute1.lab3.stack compute1
172.16.103.102 compute2.lab3.stack compute2
172.16.103.106 storage1.lab3.stack storage1
172.16.103.107 storage2.lab3.stack storage2
# END ANSIBLE GENERATED HOSTS
-----

I've come to realize that its stupid/wrong/etc to have multiple lines with the same IP in hosts, but since kolla simply appends, I'd get the same error even if I clean up that issue.
(I will though, and see if that helps)

Revision history for this message
Alex Jackson (xelaot) wrote :
Download full text (3.8 KiB)

"Could you verify that?"

Just like above, I am in a private environment and I also am running an all-in-one deployment so I have not touched /etc/hosts

Either way, here's the /etc/hosts file:
127.0.0.1 localhost
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack

# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
# BEGIN ANSIBLE GENERATED HOSTS
10.111.203.5 openStack
# END ANSIBLE GENERATED HOSTS

"I don't understand what you mean by [ip address of neutron network_interface]"

In the globals.yml file for kolla-ansible, there's a section to set the name of the network interface used for api services. This interface must contain an ip address.

Here's the globals.yml explanation followed by it's configuration:

# This interface is what all your api services will be bound to by default.
# Additionally, all vxlan/tunnel and storage network traffic will go over this
# interface by default. This interface must contain an IP address.
# It is possible for hosts to have non-matching names of interfaces - these can
# be set in an inventory file per host or per group or stored separately, see
# http://docs.ansible.com/ansible/intro_inventory.html
# Yet another way to workaround the naming problem is to create a bond for the
# interface on all hosts and give the bond name here. Similar strategy can be
# followed for other types of interfaces.
network_interface: "eno1"

This interface has the ip address of 10.111.203.5 which I substitute in for the ERL_EMPD_ADDRESS within the ./kolla-ansible/ansible/roles/rabbitmq/templates/rabbitmq-env.conf.j2 to hardcode the value since jinja or ansible is not substituting the address in the rabbitmq docker image.

"is this MAAS?"

Nope, just a single machine

output of those various commands:

getent hosts $(hostname)
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack

getent hosts $(hostname -s)
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack

getent hosts $(getent hosts $(hostname))
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack openStack
10.111.203.5 openStack.wsfdindl.metronetinc.net openStack openStack

getent hosts $(getent hosts $(hostname -s))
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
127.0.1.1...

Read more...

Revision history for this message
Taisto Qvist (theque42) wrote :

I tried cleaning out the bad, duplicate line in etc hosts (with the spelled out controller-name), and that didnt help, maybe because I had duplicates anyway thanks to kolla-ansible simply appending to the file.

I didnt mention it, but I hope it was obvious from my hosts that I am running a multinode install, and this hosts file has worked in mitaka and rocky...dont remember if I had time to try stein.

I lost the output from my getent-calls, since I restarted the cloud deploy, but I did notice that similar to above, I also got multiple entries for the same address. In my case ctrl1.lab1.stack.

Changed in kolla-ansible:
status: Incomplete → New
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Well, it might be the case that you are running into two different issues...
The bad news is I tried to reproduce both and could not either. :-)

In the xelaot's case, the breaking line is:
  127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
and we remove that since https://review.opendev.org/685233
(also backported to stein).
I don't see how hardcoding would help since both ways it templates out to the very same file...
Also, if api_interface_address failed to template out, there would be total havoc, not only in rmq.

theque42 - exact duplicates change nothing because then all programs tend to behave the same irrespective of the resolver being used (unless some very nasty one, but doubt it would be erlang).
Since your issue is a different one, could you create a separate bug report?
Might be easier to coordinate this way.

xelaot, please try using latest stein commit and see whether the bad line stays in the file, and what is the result of "Ensure hostname does not point to loopback in /etc/hosts" task (it is run by the bootstrap).

Revision history for this message
Taisto Qvist (theque42) wrote :

I'll try to create a new case later today, but I just wanna let you know that I reran today, with the following setup, and with the same problem.

[root@ctrl1 ~(admin)]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

172.16.102.100 ctrl1.lab2.stack controller1 ctrl1
172.16.102.109 ctrl2.lab2.stack controller2 ctrl2
172.16.102.101 compute1.lab2.stack compute1
172.16.102.102 compute2.lab2.stack compute2
172.16.102.103 compute3.lab2.stack compute3
172.16.102.104 neutron1.lab2.stack neutron1
172.16.102.105 neutron2.lab2.stack neutron2
172.16.102.106 storage1.lab2.stack storage1
172.16.102.107 storage2.lab2.stack storage2
172.16.102.108 storage3.lab2.stack storage3
172.16.102.111 int.lab2.stack int haint lbi
10.10.102.111 ext.lab2.stack ext haext lbx

[lab2]:admin@admin
[root@ctrl1 ~(admin)]# getent hosts $(hostname)
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1

[lab2]:admin@admin
[root@ctrl1 ~(admin)]# getent hosts $(hostname -s)
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1

[lab2]:admin@admin
[root@ctrl1 ~(admin)]# getent hosts $(getent hosts $(hostname))
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1

[lab2]:admin@admin
[root@ctrl1 ~(admin)]# getent hosts $(getent hosts $(hostname -s))^C

[lab2]:admin@admin
[root@ctrl1 ~(admin)]# getent hosts $(hostname)
172.16.102.100 ctrl1.lab2.stack controller1 ctrl1

[root@ctrl1 ~(admin)]# docker logs rabbitmq 2>&1| tail -10
++ [[ -n '' ]]
++ [[ ! -d /var/log/kolla/rabbitmq ]]
+++ stat -c %a /var/log/kolla/rabbitmq
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/rabbitmq
Running command: '/usr/sbin/rabbitmq-server'
+ echo 'Running command: '\''/usr/sbin/rabbitmq-server'\'''
+ exec /usr/sbin/rabbitmq-server
econnrefused

Revision history for this message
Taisto Qvist (theque42) wrote :

Created: #1856281

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Reviving the original issue:

In the xelaot's case, the breaking line is:
  127.0.1.1 openStack.wsfdindl.metronetinc.net openStack
and we remove that since https://review.opendev.org/685233
(also backported to stein).
I don't see how hardcoding would help since both ways it templates out to the very same file...
Also, if api_interface_address failed to template out, there would be total havoc, not only in rmq.

xelaot, please try using latest stein commit and see whether the bad line stays in the file, and what is the result of "Ensure hostname does not point to loopback in /etc/hosts" task (it is run by the bootstrap).

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for kolla-ansible because there has been no activity for 60 days.]

Changed in kolla-ansible:
status: Incomplete → Expired
Revision history for this message
Magnus Lööf (magnus-loof) wrote :

This seems to be related to the fact that when deploying RabbitMQ, it needs the following sysctl:

```
  - net.ipv4.ip_nonlocal_bind: 1
  - net.ipv6.ip_nonlocal_bind: 1
```

those are set in the HAProxy deployment - but if you separated Control and Network nodes you would run into this problem.

Revision history for this message
Lianhao Lu (lianhao-lu) wrote :

@Magnus Lööf,

Thanks your workaround works. After apply those sysctl, it works.

I met the same issue with kolla-ansible 9.0.1 with kolla/ubuntu-source-rabbitmq:train keeps restarting in my multinode environment. I've checked that the /etc/hosts are all clean.

Changed in kolla-ansible:
status: Expired → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/716207

Changed in kolla-ansible:
assignee: nobody → Magnus Lööf (magnus-loof)
status: Confirmed → In Progress
Revision history for this message
Magnus Lööf (magnus-loof) wrote :

OK so here is some testing:

With `# export ERL_EPMD_ADDRESS=10.99.30.13` in `rabbitmq-env.conf`:

```
 sudo ss -tlnp | grep 4369
LISTEN 0 128 *:4369 *:* users:(("epmd",pid=12391,fd=3))
LISTEN 0 128 :::4369 :::* users:(("epmd",pid=12391,fd=4))
```

With `export ERL_EPMD_ADDRESS=10.99.30.13` in `rabbitmq-env.conf`:

```
sudo docker logs rabbitmq
...
+ echo 'Running command: '\''/usr/sbin/rabbitmq-server'\'''
+ exec /usr/sbin/rabbitmq-server
econnrefused
```

With `sysctl net.ipv6.ip_nonlocal_bind=1` and `sysctl net.ipv4.ip_nonlocal_bind=1`

```
sudo ss -tlnp | grep 4369
LISTEN 0 128 10.99.30.13:4369 *:* users:(("epmd",pid=15886,fd=5))
LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=15886,fd=3))
LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=15886,fd=4))
```

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Does it need both sysctls?

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

I did not try with only ipv6, but it did not work with only ipv4.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Please do try it with IPv6 only.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

It was only `sysctl net.ipv6.ip_nonlocal_bind=1` that was required.

Checking some more, I found that

```
- net.ipv6.conf.all.disable_ipv6: 1
- net.ipv6.conf.default.disable_ipv6: 1
```

were also set. Setting those to `0` made it work without either `nonlocal_bind`

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Ah, splendid. Just like discussed in the bug marked as duplicate now.

I was wondering there if rmq does not try to bind to 'localhost'-resolved addresses.
Could you make sure localhost points only to 127.0.0.1 in /etc/hosts and retry the failure scenario?

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Every 2,0s: cat /etc/hosts Thu Apr 2 17:51:17 2020

127.0.0.1 localhost
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# BEGIN ANSIBLE GENERATED HOSTS
10.99.30.7 n-25484-bpc1 n-25484-bpc1.vms.basalt.se
10.99.30.6 n-25484-bpc2 n-25484-bpc2.vms.basalt.se
10.99.30.16 n-25484-bpc3 n-25484-bpc3.vms.basalt.se
10.99.30.5 n-25484-bpc4 n-25484-bpc4.vms.basalt.se
# END ANSIBLE GENERATED HOSTS

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Thanks, now try removing that ::1 line

Changed in kolla-ansible:
assignee: Magnus Lööf (magnus-loof) → Radosław Piliszek (yoctozepto)
importance: Undecided → Low
Revision history for this message
Mark Goddard (mgoddard) wrote :

EPMD docs say it will listen on localhost and addresses specified: https://erlang.org/doc/man/epmd.html

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

OK, I validated that modifying /etc/hosts is of no help here but it seems that one should generally never disable IPv6 on the loopback (lo) as many programs may depend on its presence whenever IPv6 is enabled. If you want to disable IPv6, then please do it via ipv6.disable=1 on kernel cmdline so that it never registers its AF and all IPv6-enabled software is aware not to try working with IPv6 sockets (rmq included here).
What kolla-ansible can do is to add a precheck that validates whether this is really the case.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Invalidating because the issue was with broken IPv6 stack. Please don't break your IPv6 or bad things may happen (TM).

Changed in kolla-ansible:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by "Magnus Lööf <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/716207
Reason: Bug invalid

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.