sheepdog

Dual nic environment: non I/O nic disconnection causes problems

Bug #1263073 reported by sirio81 on 2013-12-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	sheepdog	New	Undecided	Unassigned

Bug Description

Scenario:
a cluster with 4 nodes.

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 53 67807424
   1 192.168.10.5:7000 104 84584640
   2 192.168.10.6:7000 49 101361856
   3 192.168.10.7:7000 50 118139072

Zookeeper is running on node di 0,1,2.
Each node has dual nic: eth0 (used by zookeeper), eth1 (I/O nic).
Note: eth0 is bridged (br0).

How to reprduce:

unplug the cable of an eth0 (or ifdown br0).
E.g. unplug cable of node id 3.

Symptom:

On node id 0,1,2

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 50 67807424
   1 192.168.10.5:7000 97 84584640
   2 192.168.10.6:7000 45 101361856

On node id 3 (sheep is still alive)

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 53 67807424
   1 192.168.10.5:7000 104 84584640
   2 192.168.10.6:7000 49 101361856
   3 192.168.10.7:7000 50 118139072

If a guest is running on node id 3 and it make a write request

On the disconnected node it continues to print this message

tail -5 /var/sheep/sheep.log
Dec 19 15:59:48 ERROR [gway 3111] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3096] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3110] sheep_exec_req(1008) failed Request has an old epoch
Dec 19 15:59:48 ERROR [gway 3113] sheep_exec_req(1008) failed Request has an old epoch

On the other nodes, it continues to print this message

tail -5 /var/sheep/sheep.log
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)
Dec 19 15:59:41 ERROR [main] check_request_epoch(151) old node version 2, 1 (READ_PEER)

- sheep process cpu usage rises to 50-80% on all nodes.

- In less than 2 minutes sheep.log grows 10-48M and it will never stop.

Expected bahavior:

Once the node exits the cluster, any write request on the disconnected node shouldn't affect the other nodes.

Summary

all 4 nodes on line;
remove cable from eth0 from a node;
after 30 seconds, the nodes leaves the cluster;
recovery begins and ends;
till now nothing strange happens;
when the guest on the disconnected node tries to write something, it's able to communicate with the other nodes by the I/O nic (eth1), and sheep starts writing on sheep.log etc...

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.