Comment 25 for bug 1253896

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I am not able to identify any specific patch which might have made things so worse.
As an interesting data pointer, the large majority of the failures occurs with the non-isolated job.

I am not yet able even to root-cause the failure. I looked at 3 distinct traces and in every case:
- the VM VIF is wired before the timeout
- the DHCP is active and distributes an address before the timeout expires (the DHCP ACK for the VIF's MAC can be seen in syslog)
- the internal router port is always configured on the l3 agent and wired on the ovs agent before the timeout
- the floating IP is configured on the l3 agent before the timeout

Security groups issues would cause 100% of jobs to fail (and also since we're not using the hybrid driver it seems they're not even enforced).

The most likely cause of failure, even if I cannot confirm, should reside in this case in the L3 agent, and in particular for the floating IP. If the machine gets an IP from its DHCP server, L2 connectivity is there, and the correct IP is given. The router port is also correctly configured.
Even if I analysed only 3 failures, I noticed that I did not see any crash in test_network_basic_ops, while in all 3 failures all the three scenario tests creating a floating IP via nova failed. I do not know if this might be a good trail to follow, but the failure pattern is definetely peculiar.

In the logs the address which ends up being unreachable is assigned to other qg-xxx interfaces before the failure. The same address being on multiple qg interfaces on br-ex could be a problem, even if those interfaces are in different namespaces, sincethey are all on the same network.

I also checked devstack, devstack-gate, and tempest for relevant changes, but even there no luck.