Comment 22 for bug 1289200

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Dmitry, please note, I was able to reproduce this (#21) for default TCPKA sysctls in host OS (7200, 75, 9). But It wasn't reproduceable anymore with https://review.openstack.org/#/c/94147/.
And that is how I was testing this:

0) Given
- node-1 192.168.0.3 primary controller hosting the VIP 192.168.0.2
- node-5 192.168.0.7 compute node under test (it should be connected via AQMP to the given node-1 as well)
- rabbitmq accept rule for controller has a number #16
- spawned some instances at compute under test

*CASE A.*
1)add iptables block rules from node-5 to node-1:5673 (take care for conntrack as well!)
[root@node-1 ~]# iptables -I INPUT 16 -s 192.168.0.7/32 -p tcp --dport 5673 -j DROP
[root@node-1 ~]# iptables -I FORWARD 1 -s 192.168.0.7/32 -p tcp ! --syn -m state --state NEW -j DROP

2.a)send a task to spawn some and delete some others instances at the compute under test
2.b)watch for accumulated messages in the queue for compute under test
[root@node-1 ~]# rabbitmqctl list_queues name messages | grep compute| egrep -v "0|^$"

3.a)wait for AMQP reconnections from the compute under test (i.e. just grep its logs for "Reconnecting to AMQP")
3.b) watch for established connections continuously:
lsof: [root@node-5 ~]# lsof -i -n -P -itcp -iudp | egrep 'nova-comp'
ss: [root@node-5 ~]# ss -untap | grep -E '5673\s+u.*compute'

4)remove iptables drop rules (teardown)

The expected result was:
a) Reconnection attemtps from coumpute under test within a ~2 min after the traffic was blocked at the controller side with iptables
b) Hanged requests for instances (non consumed messages in queues) should 'alive' after compute under test being successful reconnected to new AMQP node.

*CASE B*.
The same as CASE a but instead of iptables rules issue a kill -9 beam pid on node-1 which has an open connection with the compute under test, e.g. (for given data in section (0):
[root@node-1 ~]# ss -untap | grep '\.2:5673.*\.6'
tcp ESTAB 0 0 ::ffff:192.168.0.2:5673 ::ffff:192.168.0.6:32790 users:(("beam",28198,35))
...
[root@node-1 ~]# kill -9 28198

So, Dmitry, could you please elaborate:
1) either your test cases or results was different.
2) what was your expected results - a)reconnection attempts from compute and/or b)absence of rmq related operations with the dead sockets in strace outputs, c)some other rmq traffic related criterias as well?