different services failed to load after reboot of undercloud
Bug #1612789 reported by
James Slagle
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Medium
|
Emilien Macchi |
Bug Description
Occassionally after an undercloud reboot, some services will fail to load:
[stack@instack ~]$ sudo systemctl|grep -i fail
● neutron-
● ovirt-guest-
● postfix.service loaded failed failed Postfix Mail Transport Agent
● rabbitmq-
Manually starting rabbitmq and neutron-server works.
Workaround:
systemctl start rabbitmq-
systemctl start neutron-
nova list
Changed in tripleo: | |
assignee: | James Slagle (james-slagle) → Emilien Macchi (emilienm) |
To post a comment you must log in.
see also: /bugzilla. redhat. com/show_ bug.cgi? id=1348700
https:/
[reply] [−]
Private
Comment 34 John Eckersberg 2016-08-09 11:12:31 EDT
So here's what I think is wrong.
We've got...
[root@undercloud ~]# grep After /usr/lib/ systemd/ system/ rabbitmq- server. service target epmd@0.0.0.0.socket
After=network.
After=network. target is pretty standard. All sorts of stuff uses it, openstack services and core services alike:
[root@undercloud ~]# grep -l 'After= .*network. target' /usr/lib/ systemd/ system/ *.service | wc -l
128
So what's the deal? Relevant excerpts from https:/ /www.freedeskto p.org/wiki/ Software/ systemd/ NetworkTarget/ :
"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network. target can be sure that it is stopped before the network is shut down if the system is powered off."
OK, not what I would expect.
"network- online. target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network."
OK, so in theory we should not need that for our servers. But in practice, I think we do. Here's why. We explicitly bind our services to non-loopback/ non-wildcard addresses:
[root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf 192.0.2. 1 listen= 192.0.2. 1 listen= 192.0.2. 1 listen= 192.0.2. 1 rabbitmq- env.conf ADDRESS= 192.0.2. 1
ec2_listen=
osapi_compute_
metadata_
osapi_volume_
[root@undercloud ~]# grep 192 /etc/rabbitmq/
NODE_IP_
etc.
And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind:
[root@undercloud ~]# cat /proc/sys/ net/ipv4/ ip_nonlocal_ bind
0
Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP.
So the race that happens on the undercloud is roughly:
(1) systemd reaches network.target, interface not yet configured with address
(2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails
(3) NetworkManager configures address on interface
Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.
So, how to fix?
Option 1 is to set ip_nonlocal_bind on the undercloud.
Option 2 from the systemd docs is to enable NetworkManager- wait-online. service, which causes network.target to behave the same as network- online. target. This will ensure that the interfaces are up and addresses configured before starting the rest of the services.
I don't really have an opinio...