Comment 30 for bug 1253896

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

We have identified several causes for the SSH timeout, but one particular failure mode has always been puzzling me.
It can be observed here for instance: http://logs.openstack.org/58/63558/2/check/check-tempest-dsvm-neutron-pg/1bd26df

What happens is that everything appears to be perfectly connected.
The VM port is up, the dnsmasq instance is up and entries in the host file properly populated.
Also, router connections and floating IP setup is done properly.

However, no DHCPDISCOVER is seen from the VM (although the DHCPRELEASE is sent when the VM is destroyed, confirming the host file was correctly configured).

It seems that I have know a hint about the root cause. It just seems too easy to be true.
Basically, the problem happens when the network is created immediately after the service and the agent start up. what happens is that DHCP port is being wired on the DEAD vlan instead of being tagged with the appropriate tag.

The agent fails to retrieve the port details from the plugin, because the ML2 plugin failed the binding.
Looking at the neutron server logs, it seems that the binding fails simply because there are no agents. The other failure modes are i) no active agent and ii) error while adding the segment, but the according to the logs none of this two conditions is happening.

Indeed the neutron server log (excerpt: http://paste.openstack.org/show/55918/) reveals the first state report from the OVS agent is received 128 milliseconds after the create_dhcp_port message from the DHCP Agent.

How to solve this problem?
One way would to discuss what do with unbound ports. There might a be a sort of automatic rescheduling with gets triggered once agents become available; this is doable but perhaps it is something that deserves more discussion.

If a port failed to bind, it will stay in a DOWN status, which is a consistent representation of the fact the port does not work. Since that's a DHCP port, this is a problem for the whole network. Tempest should not even attempt to boot the VM in this case. This would make identifying the issue way easier.

This is a problem which might occur at startup, but also when agents fail to report their state under load.
We have already at least two developers actively working on this. I would say, gate-wise, to adopt the following mitigation measures:
1 - Delay tempest test start until the relevant agents are reported as active. If they do not become active, that's a valid reason for failing tempest (or devstack), in my opinion
2 - Increase the timeout interval for reporting agents dead. This will decrease the chance that agents are declared dead when higher load on the gate might cause delays in processing state reports.