Bug #1612789 “different services failed to load after reboot of ... : Bugs : tripleo

Revision history for this message

James Slagle (james-slagle) wrote on 2016-08-12:

#1

Download full text (3.1 KiB)

see also:
https://bugzilla.redhat.com/show_bug.cgi?id=1348700

[reply] [−]
Private
Comment 34 John Eckersberg 2016-08-09 11:12:31 EDT

So here's what I think is wrong.

We've got...

[root@undercloud ~]# grep After /usr/lib/systemd/system/rabbitmq-server.service
After=network.target epmd@0.0.0.0.socket

After=network.target is pretty standard. All sorts of stuff uses it, openstack services and core services alike:

[root@undercloud ~]# grep -l 'After=.*network.target' /usr/lib/systemd/system/*.service | wc -l
128

So what's the deal? Relevant excerpts from https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off."

OK, not what I would expect.

"network-online.target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network."

OK, so in theory we should not need that for our servers. But in practice, I think we do. Here's why. We explicitly bind our services to non-loopback/non-wildcard addresses:

[root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf
ec2_listen=192.0.2.1
osapi_compute_listen=192.0.2.1
metadata_listen=192.0.2.1
osapi_volume_listen=192.0.2.1
[root@undercloud ~]# grep 192 /etc/rabbitmq/rabbitmq-env.conf
NODE_IP_ADDRESS=192.0.2.1

etc.

And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind:

[root@undercloud ~]# cat /proc/sys/net/ipv4/ip_nonlocal_bind
0

Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP.

So the race that happens on the undercloud is roughly:

(1) systemd reaches network.target, interface not yet configured with address
(2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails
(3) NetworkManager configures address on interface

Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.

So, how to fix?

Option 1 is to set ip_nonlocal_bind on the undercloud.

Option 2 from the systemd docs is to enable NetworkManager-wait-online.service, which causes network.target to behave the same as network-online.target. This will ensure that the interfaces are up and addresses configured before starting the rest of the services.

I don't really have an opinio...

see also:
https://bugzilla.redhat.com/show_bug.cgi?id=1348700

[reply] [−]
Private
Comment 34 John Eckersberg 2016-08-09 11:12:31 EDT

So here's what I think is wrong.

We've got...

[root@undercloud ~]# grep After /usr/lib/systemd/system/rabbitmq-server.service 
After=network.target epmd@0.0.0.0.socket

After=network.target is pretty standard.  All sorts of stuff uses it, openstack services and core services alike:

[root@undercloud ~]# grep -l 'After=.*network.target' /usr/lib/systemd/system/*.service | wc -l
128

So what's the deal?  Relevant excerpts from https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off."

OK, not what I would expect.

"network-online.target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network."

OK, so in theory we should not need that for our servers.  But in practice, I think we do.  Here's why.  We explicitly bind our services to non-loopback/non-wildcard addresses:

[root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf
ec2_listen=192.0.2.1
osapi_compute_listen=192.0.2.1
metadata_listen=192.0.2.1
osapi_volume_listen=192.0.2.1
[root@undercloud ~]# grep 192 /etc/rabbitmq/rabbitmq-env.conf 
NODE_IP_ADDRESS=192.0.2.1

etc.

And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind:

[root@undercloud ~]# cat /proc/sys/net/ipv4/ip_nonlocal_bind 
0

Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP.

So the race that happens on the undercloud is roughly:

(1) systemd reaches network.target, interface not yet configured with address
(2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails
(3) NetworkManager configures address on interface

Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.

So, how to fix?

Option 1 is to set ip_nonlocal_bind on the undercloud.

Option 2 from the systemd docs is to enable NetworkManager-wait-online.service, which causes network.target to behave the same as network-online.target.  This will ensure that the interfaces are up and addresses configured before starting the rest of the services.

I don't really have an opinion either way.

Changed in tripleo:
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → James Slagle (james-slagle)
milestone:	none → newton-3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-12: Fix proposed to instack-undercloud (master)

#2

Fix proposed to branch: master
Review: https://review.openstack.org/355051

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-15: Fix proposed to instack-undercloud (stable/mitaka)

#3

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/355612

OpenStack Infra (hudson-openstack) on 2016-08-15

Changed in tripleo:
assignee:	James Slagle (james-slagle) → Emilien Macchi (emilienm)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-17:

#4

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/356372

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-17: Fix merged to instack-undercloud (master)

#5

Reviewed: https://review.openstack.org/355051
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=fd167c1a9d24650df69578dfa93b86e9874a79fd
Submitter: Jenkins
Branch: master

commit fd167c1a9d24650df69578dfa93b86e9874a79fd
Author: James Slagle <email address hidden>
Date: Fri Aug 12 15:55:01 2016 -0400

Enable sysctl nonlocal_bind

    Sometimes after rebooting an undercloud some services will fail to start
    because the IP address has not yet been configured on br-ctlplane.
    Setting nonlocal_bind in sysctl will allow the services to bind to the
    IP anyway.

    Depends-On: I24ab535b01e2724af457d39c03cd990c574ef0aa
    Change-Id: Iac7c4a86f796e9ad0b1d7a08d8807579ba8964bd
    Closes-Bug: #1612789

Changed in tripleo:
status:	In Progress → Fix Released
tags:	added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-17: Fix merged to instack-undercloud (stable/mitaka)

#6

Reviewed: https://review.openstack.org/355612
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=e716914033b55b0d0e2e8976d80c3ab1653fca72
Submitter: Jenkins
Branch: stable/mitaka

commit e716914033b55b0d0e2e8976d80c3ab1653fca72
Author: Emilien Macchi <email address hidden>
Date: Mon Aug 15 15:20:16 2016 -0400

(mitaka only) enable sysctl nonlocal_bind

    Sometimes after rebooting an undercloud some services will fail to start
    because the IP address has not yet been configured on br-ctlplane.
    Setting nonlocal_bind in sysctl will allow the services to bind to the
    IP anyway.

The patch will be addressed to master but using the TripleO profiles
instead.

    Co-Authorized-By: James Slagle <email address hidden>
    Change-Id: Ifb009b781d00729d6674fbcc43d844404998ed8e
    Closes-Bug: #1612789

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-17: Change abandoned on instack-undercloud (stable/mitaka)

#7

Change abandoned by Jiri Stransky (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/356372
Reason: merged https://review.openstack.org/#/c/355612/

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-08-30: Fix included in openstack/instack-undercloud 4.2.0

#8

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-01: Fix included in openstack/instack-undercloud 5.0.0.0b3

#9

This issue was fixed in the openstack/instack-undercloud 5.0.0.0b3 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10: Fix included in openstack/instack-undercloud 4.2.0

#10

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

tripleo

different services failed to load after reboot of undercloud

Bug Description

Other bug subscribers

Remote bug watches