Corosync doesn't start on boot if bonding is configured: 'No nodelist defined or our node is not in the nodelist'

Bug #1546947 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dmitry Bilunov
8.0.x
Fix Released
High
Dmitry Bilunov

Bug Description

After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if network bond is configured for management network and it's tagged (VLAN):

root@node-2:~# crm_mon -1
Last updated: Thu Feb 18 09:17:06 2016
Last change: Wed Feb 17 14:21:54 2016
Stack: corosync
Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured

Online: [ node-2.test.domain.local ]
OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]

# corosync.log (I enabled debug):

Feb 18 09:08:38 [4986] node-3.test.domain.local corosync notice [TOTEM ] timer_function_netif_check_timeout The network interface is down.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [TOTEM ] main_iface_change_fn Created or loaded sequence id 0.127.0.0.1 for this ring.
...
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [VOTEQ ] votequorum_read_nodelist_configuration No nodelist defined or our node is not in the nodelist
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync crit [QUORUM] quorum_exec_init_fn Quorum provider: corosync_votequorum failed to initialize.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [SERV ] corosync_service_defaults_link_and_init Service engine 'corosync_quorum' failed to load for reason '
nfigured!'
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [MAIN ] _corosync_exit_error Corosync Cluster Engine exiting with status 20 at service.c:356.

But after boot I can SSH to controllers and start Corosync manually without any errors.

Steps to reproduce:

1. Prepare virtual machines (http://paste.openstack.org/show/487375/)
2. Create new environment, choose Neutron + VXLAN
3. Add 3 controller and 2 compute+ceph nodes
4. Configure network bonds for all nodes (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)"). Management network should be VLAN tagged.
5. Verify networks
6. Deploy the environment
7. Verify networks
8. Run OSTF tests
9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot.
10. Verify networks
11. Run OSTF tests

Expected result: after reboot cloud works fine

Actual result: after reboot cloud services are down

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkTU5kMDlMRnBsQ3M/view?usp=sharing

According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output:

http://paste.openstack.org/show/487378/

As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
status: New → Confirmed
tags: added: feature-bonding
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I looked through the logs, but I cannot find the reason of such strange behaviour so far. There is more than 1 minute interval between corosync start and networking start (corosync starts later). Moreover, there is a strict dependency between the start of networking and start of ANY of rc-sysinit services, so there MUST be an ip on br-mgmt at the time when Corosync starts. This might be an upstart bug, also.

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Dmitry Bilunov (dbilunov)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If manual starting works later, that is not a corosync conf issue but a networking conf race condition

tags: added: l23network
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

I believe we've found the root cause of problem - 'static-network-up' event is emitted before management network is actually configured (probably related upstream bug#1379427). If I change 'inet manual' to 'inet static' in config files for bonds and their sub-interfaces (vlans), then corosync starts fine on node boot.

Revision history for this message
Dmitry Bilunov (dbilunov) wrote :

So far known facts:

1. Bug is not reproduced sometimes on "bad" environments.
2. Bug cannot be reproduced on "good" environments. I have tried to deploy a "bad" environment three times and failed to reproduce the bug even after several reboots.
3. Upstart logs modification time shows us that there are two groups of network intefaces that are brought up during different time intervals. Some services are started between these intervals, so ones depending on the latter group would fail.
( http://paste.openstack.org/show/487418/ )
4. Adding "sleep 5" inside /etc/network/if-up.d/upstart significantly reduces the chance to hit this bug.
5. You can start corosync manually in case of failure.

Revision history for this message
Dmitry Bilunov (dbilunov) wrote :

If we redirect ifquery's stderr to a file, then it would contain the following errors after the system boots:

ifquery: recursion detected for parent interface bond1 in post-up phase
ifquery: recursion detected for interface bond1.101 in post-up phase
ifquery: recursion detected for parent interface bond1 in parent-lock phase
ifquery: recursion detected for parent interface bond1 in parent-lock phase
ifquery: recursion detected for parent interface bond1 in parent-lock phase
ifquery: recursion detected for interface p_ff798dba-0 in post-up phase

Revision history for this message
Dmitry Bilunov (dbilunov) wrote :

Found a workaround we can easily implement - adding a dummy interface to the br-mgmt bridge fixes the problem.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Dmitry, I think I've found why 'static-network-up' event is emitted too early, there is a regression in Ubuntu ifupdown package (see bug #1545302 ). Briefly what happens:

1. Corosync is started by rc-sysinit, because it has old style init script, while rc-sysinit is started after umm and umm after 'static-network-up' event. Corosync has the highest priority in RC, so basically it's executed right after networking.

2. The 'networking' service just executes `ifup -a` command while starting. The 'ifup' goes over all network configuration files and starts 'auto' interfaces. At the end of each interface start 'ifup' executes '/etc/network/if-up.d/upstart' [0] script which does emit 'static-network-up' if *all* interfaces are UP.

3. The event 'static-network-up' is emitted during 'bond.101' configuration due to bug in ifupdown code - command on line #32 in [0] returns only 2 interfaces (it *must* always print *all* auto interfaces) and as was mentioned above 'ifquery' prints to warnings stderr:

root@node-2:~# env IFUPDOWN_bond1.101=post-up > /dev/null
root@node-2:~# ifquery --list
lo
bond0
ifquery: recursion detected for interface bond1 in parent-lock phase
ifquery: recursion detected for parent interface bond1 in parent-lock phase

4. Since network is "ready", corosync is started, but br-mgmt is still not configured, so the service fails.

Here is a patch were recursion detection was added to ifupdown (see bug #1532722): http://launchpadlibrarian.net/235926320/ifupdown_0.7.47.2ubuntu4.1_0.7.47.2ubuntu4.3.diff.gz In the comment [1] to the bug #1545302 @dgadomski suggests a patch which fixes the regression. I installed his ifupdown package (see instructions here [2]) and confirm that it helps, corosync starts fine on boot.

Also I checked old ifupdown [3] package, with it the problem is gone too.

[0] http://paste.openstack.org/show/487460/
[1] https://bugs.launchpad.net/ubuntu/+source/wpa/+bug/1545302/comments/12
[2] https://bugs.launchpad.net/ubuntu/+source/wpa/+bug/1545302/comments/11
[3] http://launchpadlibrarian.net/175234326/ifupdown_0.7.47.2ubuntu4.1_amd64.deb

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Checked the same case, but with LACP and active-backup bonds and confirm that the are affected too. Using of older version of ifupdown resolves the issue. We didn't caught it in auto tests, because of lack of checks after reboot:

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_bonding.py#L239-L240
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_bonding.py#L323-L324

I'm going to prepare a patch for tests which adds OSTF execution after cloud restart.

summary: - Corosync doesn't start on boot if balance-rr bonding is configured: 'No
- nodelist defined or our node is not in the nodelist'
+ Corosync doesn't start on boot if bonding is configured: 'No nodelist
+ defined or our node is not in the nodelist'
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/282089

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/282089
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=f75c8e3174d394060588f8f3866c2f4cab830624
Submitter: Jenkins
Branch: master

commit f75c8e3174d394060588f8f3866c2f4cab830624
Author: Artem Panchenko <email address hidden>
Date: Fri Feb 19 00:43:30 2016 +0200

    Check cloud health after reboot in bonding tests

    Add execution of network verification and OSTF in
    order to check that all cloud services are started
    and work properly after cluster reboot

    Change-Id: I7e1868efc7d2a34e1cb1486f6fd678e20c8850b4
    Related-bug: #1546947

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/ifupdown (master)

Related fix proposed to branch: master
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/17305

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/ifupdown (9.0)

Related fix proposed to branch: 9.0
Change author: Aleksandr Mogylchenko <email address hidden>
Review: https://review.fuel-infra.org/17307

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/ifupdown (9.0)

Change abandoned by Aleksandr Mogylchenko <email address hidden> on branch: 9.0
Review: https://review.fuel-infra.org/17307

tags: added: team-bugfix
tags: added: release-notes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (stable/7.0)

Related fix proposed to branch: stable/7.0
Review: https://review.openstack.org/285115

tags: added: 8.0 release-notes-done
removed: release-notes
Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Dmitry, can you link to the proper patch that fixed this?

Revision history for this message
Dmitry Bilunov (dbilunov) wrote :

This issue is fixed by building a patched package: https://bugs.launchpad.net/fuel/+bug/1547545

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Moving to 'fix-committed' for 8.0, because the fix for 'ifupdown' was provided by Ubuntu trusty: https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1545302

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Verified on baremetal lab with LACP bonds (Fuel 8.0 GA):

root@node-3:~# dpkg -l | grep ifupdown
ii ifupdown 0.7.47.2ubuntu4.4 amd64 high level tools to configure network interface

http://paste.openstack.org/show/490111/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "570"
  build_id: "570"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/ifupdown (master)

Change abandoned by Dmitry Teselkin <email address hidden> on branch: master
Review: https://review.fuel-infra.org/17305
Reason: updated ifupdown was released to trusty-updates, we don't need this CR anymore

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.