Ok, we made some progress. I'm updating this to the latest findings:

All jobs which failed had something in common which is the error in ovsdbapp trying to add the gateway port to the OVS bridge [1]. However, this behavior is only exhibited in master so this is what happened before we bumped ovsdbapp to 0.8.0:

With previous ovsdbapp version (and the one in pike), this error was 'hidden' by ovsdbapp and the failed transaction was just queued. The l3-agent went ahead and processed the router normally after having installed all iptables rules (especially important the iptables rule which redirects traffic to 169.254.169.254 to haproxy). Eventually, ovsdbapp will reconnect to ovsdb-server and gateway port will be added to br-int. At this point the VM may have booted already and fetched the metadata properly (gw is not involved for metadata).

With ovsdbapp 0.8.0, when the gateway router is added and this error occurs, it will not enqueue the transaction but throw an exception instead. This exception will be handled by neutron-l3-agent which is going to schedule the router for resync [2] and the gateway port will eventually added after resync. At this point, router namespace should be there and iptables should be there as well since they're created on AFTER_UPDATE notification [3].

On this last scenario it looks like when VM boots and requests metadata to 169.254.169.254, the requests arrive to the router namespace but somehow they don't get redirected to haproxy (possible iptables mangle/nat rules missing due to resyncs?) and instead, they're getting routed through the default gateway. When running this job on RDO-Cloud, metadata requests going through the default gateway will eventually hit the metadata server running on the underlying cloud (RDO cloud) and those are answered. This is why the console output reports that the VM has successfully fetched the instance-id [4] but this instance-id is i-000ca05e while it should be i-00000001 instead since it's the first instance booted in the hypervisor. The reason why we're getting such a high instance id is because it's the actual id of the undercloud within RDO-cloud since it's their metadata server the one responding to the instance requests.

We have tagged ovsdbapp 0.9.0 to include this fix [5] and bumped upper-constraints in requirements. We hope CI will be green soon with this but still we need to work a proper fix to prevent metadata request from going out of the router namespace.

[1] https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-29_02_44_09_758
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L557
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/driver.py#L289    
[4] 2017-11-29 02:50:10 |     checking http://169.254.169.254/2009-04-04/instance-id
    2017-11-29 02:50:10 |     successful after 1/20 tries: up 5.34. iid=i-000ca05e
    https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/undercloud/home/jenkins/tempest_output.log.txt.gz
[5] https://review.openstack.org/#/c/524181/