Comment 10 for bug 1323658

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote : Re: SSH EOFError - Public network connectivity check failed

The tempest revert https://review.openstack.org/#/c/97245/ has pretty much nullified impact on gate.

This is good for the gate, but does not help us nailing the root cause.
"Some sort of timing issue" is the closest thing we have at the moment.

Here's a summary of the analysis so far:
- the elastic recheck query has a fingerprint that can be matched only by neutron jobs. SSH failures have been observed also in jobs running nova-network, but it's not clear whether there is the same failure mode.
- Failures occur only on start/stop and resize tests. The other tests in the network advanced server ops scenario seem to pass always.
- The failure has been observed in upstream CI, ODL CI, and VMware CI - with exactly the same failure mode. This probably rules out any issue in neutron's agents. (VMware CI does not even run the L3 agent)
- syslog reveals the VM gets an IP even after the reboot, when instead it's not reachable through ssh.
  L2 logs and L3 logs for the same interval do not report instead changes to secuity groups, nat rules, or router interfaces.
- the ssh timeout occurs because of "connection refused" (111) rather than "no route to host" (113). This could be because:
  - instance booted but ssh service is disabled (waiting for console log output on tests)
  - ssh traffic being rejected at the host (iptables drop counters suggest this is not the case)
  - floating ip acting as a responder (nat rules for the floating IP are in place, so this should not be the case)
- no errors are seen in kernel logs.