Dual nic environment: non I/O nic disconnection causes problems

Bug #1263073 reported by sirio81
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
Undecided
Unassigned

Bug Description

Scenario:
a cluster with 4 nodes.

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 53 67807424
   1 192.168.10.5:7000 104 84584640
   2 192.168.10.6:7000 49 101361856
   3 192.168.10.7:7000 50 118139072

Zookeeper is running on node di 0,1,2.
Each node has dual nic: eth0 (used by zookeeper), eth1 (I/O nic).
Note: eth0 is bridged (br0).

How to reprduce:

unplug the cable of an eth0 (or ifdown br0).
E.g. unplug cable of node id 3.

Symptom:

On node id 0,1,2

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 50 67807424
   1 192.168.10.5:7000 97 84584640
   2 192.168.10.6:7000 45 101361856

On node id 3 (sheep is still alive)

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 53 67807424
   1 192.168.10.5:7000 104 84584640
   2 192.168.10.6:7000 49 101361856
   3 192.168.10.7:7000 50 118139072

If a guest is running on node id 3 and it make a write request

On the disconnected node it continues to print this message

tail -5 /var/sheep/sheep.log
Dec 19 15:59:48 ERROR [gway 3111] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3096] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3110] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch

On the other nodes, it continues to print this message

tail -5 /var/sheep/sheep.log
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)

- sheep process cpu usage rises to 50-80% on all nodes.

- In less than 2 minutes sheep.log grows 10-48M and it will never stop.

Expected bahavior:

Once the node exits the cluster, any write request on the disconnected node shouldn't affect the other nodes.

Summary

all 4 nodes on line;
remove cable from eth0 from a node;
after 30 seconds, the nodes leaves the cluster;
recovery begins and ends;
till now nothing strange happens;
when the guest on the disconnected node tries to write something, it's able to communicate with the other nodes by the I/O nic (eth1), and sheep starts writing on sheep.log etc...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers