2016-02-18 09:40:58 |
Artem Panchenko |
bug |
|
|
added bug |
2016-02-18 09:40:58 |
Artem Panchenko |
attachment added |
|
corosync.log https://bugs.launchpad.net/bugs/1546947/+attachment/4574261/+files/corosync.log |
|
2016-02-18 09:41:15 |
Artem Panchenko |
nominated for series |
|
fuel/8.0.x |
|
2016-02-18 09:41:15 |
Artem Panchenko |
bug task added |
|
fuel/8.0.x |
|
2016-02-18 09:41:22 |
Artem Panchenko |
fuel/8.0.x: milestone |
|
8.0 |
|
2016-02-18 09:41:32 |
Artem Panchenko |
fuel/8.0.x: assignee |
|
Fuel Library Team (fuel-library) |
|
2016-02-18 09:41:35 |
Artem Panchenko |
fuel/8.0.x: importance |
Undecided |
High |
|
2016-02-18 09:41:39 |
Artem Panchenko |
fuel/8.0.x: milestone |
8.0 |
8.0-updates |
|
2016-02-18 09:53:04 |
Nastya Urlapova |
fuel: status |
New |
Confirmed |
|
2016-02-18 09:53:06 |
Nastya Urlapova |
fuel/8.0.x: status |
New |
Confirmed |
|
2016-02-18 10:07:27 |
Matthew Mosesohn |
tags |
area-library |
area-library feature-bonding |
|
2016-02-18 10:10:54 |
Vladimir Kuklin |
fuel: assignee |
Fuel Library Team (fuel-library) |
Vladimir Kuklin (vkuklin) |
|
2016-02-18 10:10:58 |
Vladimir Kuklin |
fuel/8.0.x: assignee |
Fuel Library Team (fuel-library) |
Vladimir Kuklin (vkuklin) |
|
2016-02-18 10:51:23 |
Vladimir Kuklin |
fuel: assignee |
Vladimir Kuklin (vkuklin) |
Dmitry Bilunov (dbilunov) |
|
2016-02-18 10:51:29 |
Vladimir Kuklin |
fuel/8.0.x: assignee |
Vladimir Kuklin (vkuklin) |
Dmitry Bilunov (dbilunov) |
|
2016-02-18 10:59:15 |
Bogdan Dobrelya |
tags |
area-library feature-bonding |
area-library feature-bonding l23network |
|
2016-02-18 22:28:14 |
Artem Panchenko |
summary |
Corosync doesn't start on boot if balance-rr bonding is configured: 'No nodelist defined or our node is not in the nodelist' |
Corosync doesn't start on boot if bonding is configured: 'No nodelist defined or our node is not in the nodelist' |
|
2016-02-18 22:30:12 |
Artem Panchenko |
description |
After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if 'balance-rr' bonds are configured for management network:
root@node-2:~# crm_mon -1
Last updated: Thu Feb 18 09:17:06 2016
Last change: Wed Feb 17 14:21:54 2016
Stack: corosync
Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured
Online: [ node-2.test.domain.local ]
OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]
# corosync.log (I enabled debug):
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync notice [TOTEM ] timer_function_netif_check_timeout The network interface is down.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [TOTEM ] main_iface_change_fn Created or loaded sequence id 0.127.0.0.1 for this ring.
...
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [VOTEQ ] votequorum_read_nodelist_configuration No nodelist defined or our node is not in the nodelist
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync crit [QUORUM] quorum_exec_init_fn Quorum provider: corosync_votequorum failed to initialize.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [SERV ] corosync_service_defaults_link_and_init Service engine 'corosync_quorum' failed to load for reason '
nfigured!'
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [MAIN ] _corosync_exit_error Corosync Cluster Engine exiting with status 20 at service.c:356.
But after boot I can SSH to controllers and start Corosync manually without any errors.
Steps to reproduce:
1. Prepare virtual machines (http://paste.openstack.org/show/487375/)
2. Create new environment, choose Neutron + VXLAN
3. Add 3 controller and 2 compute+ceph nodes
4. Configure network bonds for all nodes using 'balance-rr' mode (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)")
5. Verify networks
6. Deploy the environment
7. Verify networks
8. Run OSTF tests
9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot.
10. Verify networks
11. Run OSTF tests
Expected result: after reboot cloud works fine
Actual result: after reboot cloud services are down
Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkTU5kMDlMRnBsQ3M/view?usp=sharing
According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output:
http://paste.openstack.org/show/487378/
As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine.
Also this issue doesn't affect environments with 'active-backup' or 'LACP' bonds from management net. |
After cluster reboot Corosync doesn't start on controllers (except the one which was booted first) if network bond is configured for management network and it's tagged (VLAN):
root@node-2:~# crm_mon -1
Last updated: Thu Feb 18 09:17:06 2016
Last change: Wed Feb 17 14:21:54 2016
Stack: corosync
Current DC: node-2.test.domain.local (2) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
3 Nodes configured
46 Resources configured
Online: [ node-2.test.domain.local ]
OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]
# corosync.log (I enabled debug):
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync notice [TOTEM ] timer_function_netif_check_timeout The network interface is down.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [TOTEM ] main_iface_change_fn Created or loaded sequence id 0.127.0.0.1 for this ring.
...
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync debug [VOTEQ ] votequorum_read_nodelist_configuration No nodelist defined or our node is not in the nodelist
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync crit [QUORUM] quorum_exec_init_fn Quorum provider: corosync_votequorum failed to initialize.
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [SERV ] corosync_service_defaults_link_and_init Service engine 'corosync_quorum' failed to load for reason '
nfigured!'
Feb 18 09:08:38 [4986] node-3.test.domain.local corosync error [MAIN ] _corosync_exit_error Corosync Cluster Engine exiting with status 20 at service.c:356.
But after boot I can SSH to controllers and start Corosync manually without any errors.
Steps to reproduce:
1. Prepare virtual machines (http://paste.openstack.org/show/487375/)
2. Create new environment, choose Neutron + VXLAN
3. Add 3 controller and 2 compute+ceph nodes
4. Configure network bonds for all nodes (add enp0s5, enp0s6, enp0s7, enp0s8 to bond0 and assign all networks to it except "Admin (pxe)"). Management network should be VLAN tagged.
5. Verify networks
6. Deploy the environment
7. Verify networks
8. Run OSTF tests
9. Save network settings from slave nodes. Reboot all environment slave nodes. Verify that network settings are the same after reboot.
10. Verify networks
11. Run OSTF tests
Expected result: after reboot cloud works fine
Actual result: after reboot cloud services are down
Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkTU5kMDlMRnBsQ3M/view?usp=sharing
According to the logs Corosync doesn't find node IP in 'nodelist' from corosync.conf. I guess it checks all ready interfaces while starting and compares IP addresses with those from config. I modified Corosync init script and added 'strace', here is a part of its output:
http://paste.openstack.org/show/487378/
As you can see there is no 'br-mgmt' in the list. If I just add 'sleep 2' into the init script before starting Corosync daemon, then everything works fine. |
|
2016-02-24 10:10:29 |
Matthew Mosesohn |
tags |
area-library feature-bonding l23network |
area-library feature-bonding l23network team-bugfix |
|
2016-02-25 11:34:32 |
Olga Gusarenko |
tags |
area-library feature-bonding l23network team-bugfix |
area-library feature-bonding l23network release-notes team-bugfix |
|
2016-02-26 17:53:29 |
Olga Gusarenko |
tags |
area-library feature-bonding l23network release-notes team-bugfix |
8.0 area-library feature-bonding l23network release-notes-done team-bugfix |
|
2016-03-02 10:12:37 |
Matthew Mosesohn |
fuel: status |
Confirmed |
Fix Committed |
|
2016-03-10 17:43:34 |
Artem Panchenko |
fuel/8.0.x: status |
Confirmed |
Fix Committed |
|
2016-03-11 09:41:25 |
Artem Panchenko |
fuel: status |
Fix Committed |
Fix Released |
|
2016-03-11 09:41:29 |
Artem Panchenko |
fuel/8.0.x: status |
Fix Committed |
Fix Released |
|