Ip segments lost when restart ovs-agent with openvswitch firewall
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Invalid
|
High
|
Unassigned |
Bug Description
environment:
linux version: Linux controller.
OpenStack version: Rocky
network type: vxlan or vlan
firewall driver: openvswitch
1. Create 2 VMs(vm1, vm2) in different compute nodes(node-1, node-2) with all tcp passed sg in one network.
2. Login to vm2, create a large file, for example:
vm2# dd if=/dev/zero of=/mnt/test.img bs=1G count=5
3.Login to vm1, scp vm2's large file into vm1, when scp process starts, go to step 4.
vm1# scp vm2-ip:
4.Login to node-2, and restart neutron-
node-2# systemctl restart neutron-
5.Login to vm1, and after several seconds, you will find the scp process status is stalled.
After some investigation, I found the openflow refresh causes ip segments lost.When this happened, I captured packets with "tcpdump -i tap-xxx -w tmp.pcap", and with wireshark I saw these errors:
192.168.100.19 192.168.100.5 SSH 16478 Server: [TCP ACKed unseen segment] [TCP Previous segment not captured] , Encrypted packet (len=16412)
192.168.100.19 192.168.100.5 SSH 8302 Server: [TCP ACKed unseen segment] , Encrypted packet (len=8236)
192.168.100.5 192.168.100.19 TCP 66 [TCP ACKed unseen segment] [TCP Previous segment not captured] 54354 → 22 [ACK] Seq=2509 Ack=600733 Win=16522 Len=0 TSval=2847412 TSecr=2851031
192.168.100.19 192.168.100.5 SSH 1464 Server: [TCP Spurious Retransmission] , Encrypted packet (len=1398)
192.168.100.5 192.168.100.19 TCP 78 [TCP Dup ACK 25182#1] 54354 → 22 [ACK] Seq=326305 Ack=67089901 Win=18494 Len=0 TSval=2849742 TSecr=2853310 SLE=67073429 SRE=67074827
192.168.100.5 192.168.100.19 TCP 110 [TCP Retransmission] 54354 → 22 [PSH, ACK] Seq=326173 Ack=67089901 Win=18494 Len=44 TSval=2849742 TSecr=2853310
192.168.100.19 192.168.100.5 TCP 1464 [TCP Retransmission] 22 → 54354 [ACK] Seq=70971905 Ack=346105 Win=2016 Len=1398 TSval=2853361 TSecr=2849691
192.168.100.19 192.168.100.5 TCP 1464 [TCP Retransmission] 22 → 54354 [ACK] Seq=70971905 Ack=346105 Win=2016 Len=1398 TSval=2853463 TSecr=2849691
192.168.100.19 192.168.100.5 TCP 1464 [TCP Retransmission] 22 → 54354 [ACK] Seq=70971905 Ack=346105 Win=2016 Len=1398 TSval=2854076 TSecr=2849691
And I checked the statue of this tcp connect in both compute nodes, it's still ESTABLISHED.
# conntrack -L | grep 192.168.100.5
tcp 6 299 ESTABLISHED src=192.168.100.5 dst=192.168.100.19 sport=54356 dport=22 src=192.168.100.19 dst=192.168.100.5 sport=22 dport=54354 [ASSURED] mark=0 zone=4 use=1
# conntrack -L | grep 192.168.100.5
tcp 6 287 ESTABLISHED src=192.168.100.5 dst=192.168.100.19 sport=54356 dport=22 src=192.168.100.19 dst=192.168.100.5 sport=22 dport=54354 [ASSURED] mark=0 zone=1 use=1
I have no idea why refresh openflow will cause ip segments lost, hopes someone has a way to solve this problem.
This problem is also can be reproduced in one compute node, no need in different compute nodes.
If the first time it's not reproduced, try step4 again, after several times it will be reproduced.