different services failed to load after reboot of undercloud

Bug #1612789 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Emilien Macchi

Bug Description

Occassionally after an undercloud reboot, some services will fail to load:

[stack@instack ~]$ sudo systemctl|grep -i fail
● neutron-server.service loaded failed failed OpenStack Neutron Server
● ovirt-guest-agent.service loaded failed failed oVirt Guest Agent
● postfix.service loaded failed failed Postfix Mail Transport Agent
● rabbitmq-server.service loaded failed failed RabbitMQ broker

Manually starting rabbitmq and neutron-server works.

Workaround:
systemctl start rabbitmq-server.service
systemctl start neutron-server.service
nova list

Revision history for this message
James Slagle (james-slagle) wrote :
Download full text (3.1 KiB)

see also:
https://bugzilla.redhat.com/show_bug.cgi?id=1348700

[reply] [−]
Private
Comment 34 John Eckersberg 2016-08-09 11:12:31 EDT

So here's what I think is wrong.

We've got...

[root@undercloud ~]# grep After /usr/lib/systemd/system/rabbitmq-server.service
After=network.target epmd@0.0.0.0.socket

After=network.target is pretty standard. All sorts of stuff uses it, openstack services and core services alike:

[root@undercloud ~]# grep -l 'After=.*network.target' /usr/lib/systemd/system/*.service | wc -l
128

So what's the deal? Relevant excerpts from https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off."

OK, not what I would expect.

"network-online.target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network."

OK, so in theory we should not need that for our servers. But in practice, I think we do. Here's why. We explicitly bind our services to non-loopback/non-wildcard addresses:

[root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf
ec2_listen=192.0.2.1
osapi_compute_listen=192.0.2.1
metadata_listen=192.0.2.1
osapi_volume_listen=192.0.2.1
[root@undercloud ~]# grep 192 /etc/rabbitmq/rabbitmq-env.conf
NODE_IP_ADDRESS=192.0.2.1

etc.

And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind:

[root@undercloud ~]# cat /proc/sys/net/ipv4/ip_nonlocal_bind
0

Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP.

So the race that happens on the undercloud is roughly:

(1) systemd reaches network.target, interface not yet configured with address
(2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails
(3) NetworkManager configures address on interface

Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.

So, how to fix?

Option 1 is to set ip_nonlocal_bind on the undercloud.

Option 2 from the systemd docs is to enable NetworkManager-wait-online.service, which causes network.target to behave the same as network-online.target. This will ensure that the interfaces are up and addresses configured before starting the rest of the services.

I don't really have an opinio...

Read more...

Changed in tripleo:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → James Slagle (james-slagle)
milestone: none → newton-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to instack-undercloud (master)

Fix proposed to branch: master
Review: https://review.openstack.org/355051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to instack-undercloud (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/355612

Changed in tripleo:
assignee: James Slagle (james-slagle) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/356372

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to instack-undercloud (master)

Reviewed: https://review.openstack.org/355051
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=fd167c1a9d24650df69578dfa93b86e9874a79fd
Submitter: Jenkins
Branch: master

commit fd167c1a9d24650df69578dfa93b86e9874a79fd
Author: James Slagle <email address hidden>
Date: Fri Aug 12 15:55:01 2016 -0400

    Enable sysctl nonlocal_bind

    Sometimes after rebooting an undercloud some services will fail to start
    because the IP address has not yet been configured on br-ctlplane.
    Setting nonlocal_bind in sysctl will allow the services to bind to the
    IP anyway.

    Depends-On: I24ab535b01e2724af457d39c03cd990c574ef0aa
    Change-Id: Iac7c4a86f796e9ad0b1d7a08d8807579ba8964bd
    Closes-Bug: #1612789

Changed in tripleo:
status: In Progress → Fix Released
tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to instack-undercloud (stable/mitaka)

Reviewed: https://review.openstack.org/355612
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=e716914033b55b0d0e2e8976d80c3ab1653fca72
Submitter: Jenkins
Branch: stable/mitaka

commit e716914033b55b0d0e2e8976d80c3ab1653fca72
Author: Emilien Macchi <email address hidden>
Date: Mon Aug 15 15:20:16 2016 -0400

    (mitaka only) enable sysctl nonlocal_bind

    Sometimes after rebooting an undercloud some services will fail to start
    because the IP address has not yet been configured on br-ctlplane.
    Setting nonlocal_bind in sysctl will allow the services to bind to the
    IP anyway.

    The patch will be addressed to master but using the TripleO profiles
    instead.

    Co-Authorized-By: James Slagle <email address hidden>
    Change-Id: Ifb009b781d00729d6674fbcc43d844404998ed8e
    Closes-Bug: #1612789

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on instack-undercloud (stable/mitaka)

Change abandoned by Jiri Stransky (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/356372
Reason: merged https://review.openstack.org/#/c/355612/

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/instack-undercloud 4.2.0

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/instack-undercloud 5.0.0.0b3

This issue was fixed in the openstack/instack-undercloud 5.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/instack-undercloud 4.2.0

This issue was fixed in the openstack/instack-undercloud 4.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.