Activity log for bug #1546947

Date Who What changed Old value New value Message
2016-02-18 09:40:58 Artem Panchenko bug added bug
2016-02-18 09:40:58 Artem Panchenko attachment added corosync.log https://bugs.launchpad.net/bugs/1546947/+attachment/4574261/+files/corosync.log
2016-02-18 09:41:15 Artem Panchenko nominated for series fuel/8.0.x
2016-02-18 09:41:15 Artem Panchenko bug task added fuel/8.0.x
2016-02-18 09:41:22 Artem Panchenko fuel/8.0.x: milestone 8.0
2016-02-18 09:41:32 Artem Panchenko fuel/8.0.x: assignee Fuel Library Team (fuel-library)
2016-02-18 09:41:35 Artem Panchenko fuel/8.0.x: importance Undecided High
2016-02-18 09:41:39 Artem Panchenko fuel/8.0.x: milestone 8.0 8.0-updates
2016-02-18 09:53:04 Nastya Urlapova fuel: status New Confirmed
2016-02-18 09:53:06 Nastya Urlapova fuel/8.0.x: status New Confirmed
2016-02-18 10:07:27 Matthew Mosesohn tags area-library area-library feature-bonding
2016-02-18 10:10:54 Vladimir Kuklin fuel: assignee Fuel Library Team (fuel-library) Vladimir Kuklin (vkuklin)
2016-02-18 10:10:58 Vladimir Kuklin fuel/8.0.x: assignee Fuel Library Team (fuel-library) Vladimir Kuklin (vkuklin)
2016-02-18 10:51:23 Vladimir Kuklin fuel: assignee Vladimir Kuklin (vkuklin) Dmitry Bilunov (dbilunov)
2016-02-18 10:51:29 Vladimir Kuklin fuel/8.0.x: assignee Vladimir Kuklin (vkuklin) Dmitry Bilunov (dbilunov)
2016-02-18 10:59:15 Bogdan Dobrelya tags area-library feature-bonding area-library feature-bonding l23network
2016-02-18 22:28:14 Artem Panchenko summary Corosync doesn't start on boot if balance-rr bonding is configured: 'No nodelist defined or our node is not in the nodelist' Corosync doesn't start on boot if bonding is configured: 'No nodelist defined or our node is not in the nodelist'
2016-02-18 22:30:12 Artem Panchenko description After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if 'balance-rr' bonds are configured for management network: root@node-2:~# crm_mon -1 Last updated: Thu Feb 18 09:17:06 2016 Last change: Wed Feb 17 14:21:54 2016 Stack: corosync Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum Version: 1.1.12-561c4cf 3 Nodes configured 46 Resources configured Online: [ node-2.test.domain.local ] OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ] # corosync.log (I enabled debug): Feb 18 09:08:38 [4986] node-3.test.domain.local corosync notice [TOTEM ] timer_function_netif_check_timeout The network interface is down. Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [TOTEM ] main_iface_change_fn Created or loaded sequence id 0.127.0.0.1 for this ring. ... Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [VOTEQ ] votequorum_read_nodelist_configuration No nodelist defined or our node is not in the nodelist Feb 18 09:08:38 [4986] node-3.test.domain.local corosync crit [QUORUM] quorum_exec_init_fn Quorum provider: corosync_votequorum failed to initialize. Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [SERV ] corosync_service_defaults_link_and_init Service engine 'corosync_quorum' failed to load for reason ' nfigured!' Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [MAIN ] _corosync_exit_error Corosync Cluster Engine exiting with status 20 at service.c:356. But after boot I can SSH to controllers and start Corosync manually without any errors. Steps to reproduce: 1. Prepare virtual machines (http://paste.openstack.org/show/487375/) 2. Create new environment, choose Neutron + VXLAN 3. Add 3 controller and 2 compute+ceph nodes 4. Configure network bonds for all nodes using 'balance-rr' mode (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)") 5. Verify networks 6. Deploy the environment 7. Verify networks 8. Run OSTF tests 9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot. 10. Verify networks 11. Run OSTF tests Expected result: after reboot cloud works fine Actual result: after reboot cloud services are down Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkTU5kMDlMRnBsQ3M/view?usp=sharing According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output: http://paste.openstack.org/show/487378/ As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine. Also this issue doesn't affect environments with 'active-backup' or 'LACP' bonds from management net. After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if network bond is configured for management network and it's tagged (VLAN): root@node-2:~# crm_mon -1 Last updated: Thu Feb 18 09:17:06 2016 Last change: Wed Feb 17 14:21:54 2016 Stack: corosync Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum Version: 1.1.12-561c4cf 3 Nodes configured 46 Resources configured Online: [ node-2.test.domain.local ] OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ] # corosync.log (I enabled debug): Feb 18 09:08:38 [4986] node-3.test.domain.local corosync notice [TOTEM ] timer_function_netif_check_timeout The network interface is down. Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [TOTEM ] main_iface_change_fn Created or loaded sequence id 0.127.0.0.1 for this ring. ... Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [VOTEQ ] votequorum_read_nodelist_configuration No nodelist defined or our node is not in the nodelist Feb 18 09:08:38 [4986] node-3.test.domain.local corosync crit [QUORUM] quorum_exec_init_fn Quorum provider: corosync_votequorum failed to initialize. Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [SERV ] corosync_service_defaults_link_and_init Service engine 'corosync_quorum' failed to load for reason ' nfigured!' Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [MAIN ] _corosync_exit_error Corosync Cluster Engine exiting with status 20 at service.c:356. But after boot I can SSH to controllers and start Corosync manually without any errors. Steps to reproduce: 1. Prepare virtual machines (http://paste.openstack.org/show/487375/) 2. Create new environment, choose Neutron + VXLAN 3. Add 3 controller and 2 compute+ceph nodes 4. Configure network bonds for all nodes (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)"). Management network should be VLAN tagged. 5. Verify networks 6. Deploy the environment 7. Verify networks 8. Run OSTF tests 9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot. 10. Verify networks 11. Run OSTF tests Expected result: after reboot cloud works fine Actual result: after reboot cloud services are down Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkTU5kMDlMRnBsQ3M/view?usp=sharing According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output: http://paste.openstack.org/show/487378/ As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine.
2016-02-24 10:10:29 Matthew Mosesohn tags area-library feature-bonding l23network area-library feature-bonding l23network team-bugfix
2016-02-25 11:34:32 Olga Gusarenko tags area-library feature-bonding l23network team-bugfix area-library feature-bonding l23network release-notes team-bugfix
2016-02-26 17:53:29 Olga Gusarenko tags area-library feature-bonding l23network release-notes team-bugfix 8.0 area-library feature-bonding l23network release-notes-done team-bugfix
2016-03-02 10:12:37 Matthew Mosesohn fuel: status Confirmed Fix Committed
2016-03-10 17:43:34 Artem Panchenko fuel/8.0.x: status Confirmed Fix Committed
2016-03-11 09:41:25 Artem Panchenko fuel: status Fix Committed Fix Released
2016-03-11 09:41:29 Artem Panchenko fuel/8.0.x: status Fix Committed Fix Released