Comment 2 for bug 1714660

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Several observations:

0. it's l3 legacy setup with a single controller carrying all floating ips. For l2 connectivity, openvswitch is used.

1. the test case configures a server, assigns floating ip, waits till it gets to ACTIVE, and then ssh-connects to the instance using the fip. So the data plane involves both openvswitch agents (compute and controller) as well as l3 agent on the controller node.

2. service logs don't reveal any relevant errors around the time when floating ip is configured and transitions to ACTIVE (I checked ovs agents + l3 agent). Sadly, logs don't have debug enabled, so we can't see a lot of vital information, like whether l3 agent executed arping at the right moment, or whether it configured the floating IP address on qg (gateway) device.

3. when the test case fails, it usually logs console log for the failed instance by virtue of _log_console_output method. Sadly, tempest in the job is not configured to use this feature, so all we get is "Console output not supported, cannot log". We should probably enable the feature by setting CONF.compute_feature_enabled.console_output option in tempest.conf. This would help us to reveal any issues with the instance boot.

4. the intent of the test case is to validate that carefully crafted frames of the size that corresponds to MTU of the network can pass through l3 layer without being fragmented. I think we could enhance the test to first try simple fragmented connectivity for the instance under test before going with a more advanced MTU-limited check. If the former check would pass, we would at least know that the instance is not dead (f.e. because of a kernel crash) and the issue is in MTU-specific check. This wouldn't solve the issue but would give us some clue of what happens there.

To recap, I don't think we have enough logs to meaningfully reason about the failure. To get us to a happier place, we would need to 1) enable debug logs for neutron services; 2) enable console_output in tempest; 3) extend the test to first sanity check simple connectivity. Note it all won't solve a single thing but will at least give us a better chance to spot the root cause.