Dual nic environment: non I/O nic disconnection causes problems
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
sheepdog |
New
|
Undecided
|
Unassigned |
Bug Description
Scenario:
a cluster with 4 nodes.
dog node list
Id Host:Port V-Nodes Zone
0 192.168.10.4:7000 53 67807424
1 192.168.10.5:7000 104 84584640
2 192.168.10.6:7000 49 101361856
3 192.168.10.7:7000 50 118139072
Zookeeper is running on node di 0,1,2.
Each node has dual nic: eth0 (used by zookeeper), eth1 (I/O nic).
Note: eth0 is bridged (br0).
How to reprduce:
unplug the cable of an eth0 (or ifdown br0).
E.g. unplug cable of node id 3.
Symptom:
On node id 0,1,2
dog node list
Id Host:Port V-Nodes Zone
0 192.168.10.4:7000 50 67807424
1 192.168.10.5:7000 97 84584640
2 192.168.10.6:7000 45 101361856
On node id 3 (sheep is still alive)
dog node list
Id Host:Port V-Nodes Zone
0 192.168.10.4:7000 53 67807424
1 192.168.10.5:7000 104 84584640
2 192.168.10.6:7000 49 101361856
3 192.168.10.7:7000 50 118139072
If a guest is running on node id 3 and it make a write request
On the disconnected node it continues to print this message
tail -5 /var/sheep/
Dec 19 15:59:48 ERROR [gway 3111] sheep_exec_
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_
Dec 19 15:59:48 ERROR [gway 3096] sheep_exec_
Dec 19 15:59:48 ERROR [gway 3110] sheep_exec_
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_
On the other nodes, it continues to print this message
tail -5 /var/sheep/
Dec 19 15:59:41 ERROR [main] check_request_
Dec 19 15:59:41 ERROR [main] check_request_
Dec 19 15:59:41 ERROR [main] check_request_
Dec 19 15:59:41 ERROR [main] check_request_
Dec 19 15:59:41 ERROR [main] check_request_
- sheep process cpu usage rises to 50-80% on all nodes.
- In less than 2 minutes sheep.log grows 10-48M and it will never stop.
Expected bahavior:
Once the node exits the cluster, any write request on the disconnected node shouldn't affect the other nodes.
Summary
all 4 nodes on line;
remove cable from eth0 from a node;
after 30 seconds, the nodes leaves the cluster;
recovery begins and ends;
till now nothing strange happens;
when the guest on the disconnected node tries to write something, it's able to communicate with the other nodes by the I/O nic (eth1), and sheep starts writing on sheep.log etc...