Corosync doesn't start on boot if bonding is configured: 'No nodelist defined or our node is not in the nodelist'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Dmitry Bilunov | ||
8.0.x |
Fix Released
|
High
|
Dmitry Bilunov |
Bug Description
After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if network bond is configured for management network and it's tagged (VLAN):
root@node-2:~# crm_mon -1
Last updated: Thu Feb 18 09:17:06 2016
Last change: Wed Feb 17 14:21:54 2016
Stack: corosync
Current DC: node-2.
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured
Online: [ node-2.
OFFLINE: [ node-1.
# corosync.log (I enabled debug):
Feb 18 09:08:38 [4986] node-3.
Feb 18 09:08:38 [4986] node-3.
...
Feb 18 09:08:38 [4986] node-3.
Feb 18 09:08:38 [4986] node-3.
Feb 18 09:08:38 [4986] node-3.
nfigured!'
Feb 18 09:08:38 [4986] node-3.
But after boot I can SSH to controllers and start Corosync manually without any errors.
Steps to reproduce:
1. Prepare virtual machines (http://
2. Create new environment, choose Neutron + VXLAN
3. Add 3 controller and 2 compute+ceph nodes
4. Configure network bonds for all nodes (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)"). Management network should be VLAN tagged.
5. Verify networks
6. Deploy the environment
7. Verify networks
8. Run OSTF tests
9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot.
10. Verify networks
11. Run OSTF tests
Expected result: after reboot cloud works fine
Actual result: after reboot cloud services are down
Diagnostic snapshot: https:/
According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output:
http://
As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine.
Changed in fuel: | |
status: | New → Confirmed |
tags: | added: feature-bonding |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin) |
tags: | added: team-bugfix |
tags: | added: release-notes |
tags: |
added: 8.0 release-notes-done removed: release-notes |
Changed in fuel: | |
status: | Confirmed → Fix Committed |
I looked through the logs, but I cannot find the reason of such strange behaviour so far. There is more than 1 minute interval between corosync start and networking start (corosync starts later). Moreover, there is a strict dependency between the start of networking and start of ANY of rc-sysinit services, so there MUST be an ip on br-mgmt at the time when Corosync starts. This might be an upstart bug, also.