Comment 9 for bug 1818682

I reproduce it again, this time with more information about times and running processes.

The refused request to metadata agent:

11:10:33.174494 IP (tos 0x0, ttl 64, id 9015, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.5.3.38706 > 169.254.169.254.http: Flags [S], cksum 0x167a (correct), seq 2032783790, win 29200, options [mss 1460,sackOK,TS val 166508934 ecr 0,nop,wscale 7], length 0
11:10:33.174549 IP (tos 0x0, ttl 64, id 25829, offset 0, flags [DF], proto TCP (6), length 40)
    169.254.169.254.http > 192.168.5.3.38706: Flags [R.], cksum 0xb3cc (correct), seq 0, ack 2032783791, win 0, length 0

And my monitoring about running haproxy instances in the according L3 router namespace:

root@network2:~# while (true); do date +%H:%M:%S.%N; ps afx | grep haproxy | grep 48d5a991-1327-49c9-88d9-e28b6f48c550; sleep 0.5; done
11:10:13.250165525
21719 ? Ss 0:00 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf
11:10:13.806491785
21719 ? Ss 0:00 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf
11:10:14.364991544
11:10:14.921073081
11:10:15.477189579
11:10:16.033293688
11:10:16.589044946
11:10:17.144677609
11:10:17.700802400
11:10:18.257267466
11:10:18.815026800
11:10:19.371615067
11:10:19.928728339
11:10:20.484549005
11:10:21.043642432
11:10:21.600246727
11:10:22.157465640
11:10:22.712933864
11:10:23.269805384
11:10:23.826738954
11:10:24.383053951
11:10:24.940974289
11:10:25.499760746
11:10:26.057371940
11:10:26.615289561
11:10:27.171769464
11:10:27.728147598
11:10:28.284904266
11:10:28.841412946
11:10:29.398557841
11:10:29.955469233
11:10:30.511146389
11:10:31.067153906
11:10:31.623029038
11:10:32.178155674
11:10:32.733818364
11:10:33.288982881
11:10:33.844176215
11:10:34.399436000
11:10:34.955462896
11:10:35.513774777
11:10:36.069739028
11:10:36.624940323
11:10:37.179853955
11:10:37.735817089
11:10:38.292526172
11:10:38.847372016
11:10:39.402413814
11:10:39.957459099
11:10:40.513074480
11:10:41.068534980
11:10:41.624116351
11:10:42.179488577
11:10:42.734747693
11:10:43.289827243
11:10:43.845511714
11:10:44.400984233
11:10:44.955548219
11:10:45.511576855
11:10:46.068453602
23804 ? S 0:00 \_ sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-48d5a991-1327-49c9-88d9-e28b6f48c550 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf
23805 ? R 0:00 \_ /usr/bin/python /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-48d5a991-1327-49c9-88d9-e28b6f48c550 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf
11:10:46.622401965
23813 ? Ss 0:00 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf
11:10:47.179850306
23813 ? Ss 0:00 haproxy -f /var/lib/neutron/ns-metadata-proxy/48d5a991-1327-49c9-88d9-e28b6f48c550.conf

And this clarifies, too, why the count in my previous comment can be higher than 1: This are the maintenance commands for stopping and starting haproxy.

And extracts from the Nova log:
2019-03-08 11:09:58.134 11477 INFO nova.virt.libvirt.driver [req-001abacf-3ad3-45c1-b18a-62ce2a88bd8f eddde24c14bd2a63eb89db670f79ee6d6e120f7ae29c82e919fafa2f5e7f4641 8d04de0a40cc4419aed0d33a8d5e8eee - 27a851d6a3de43b8a4a4900dfa0c3141 27a851d6a3de43b8a4a4900dfa0c3141] [instance: b331987c-d2b6-46c8-ba7a-fc46cfa7e8d0] Creating image
2019-03-08 11:10:02.992 11477 INFO nova.virt.libvirt.driver [-] [instance: b331987c-d2b6-46c8-ba7a-fc46cfa7e8d0] Instance spawned successfully.
2019-03-08 11:10:02.992 11477 INFO nova.compute.manager [req-001abacf-3ad3-45c1-b18a-62ce2a88bd8f eddde24c14bd2a63eb89db670f79ee6d6e120f7ae29c82e919fafa2f5e7f4641 8d04de0a40cc4419aed0d33a8d5e8eee - 27a851d6a3de43b8a4a4900dfa0c3141 27a851d6a3de43b8a4a4900dfa0c3141] [instance: b331987c-d2b6-46c8-ba7a-fc46cfa7e8d0] Took 4.86 seconds to spawn the instance on the hypervisor.
2019-03-08 11:10:03.113 11477 INFO nova.compute.manager [req-001abacf-3ad3-45c1-b18a-62ce2a88bd8f eddde24c14bd2a63eb89db670f79ee6d6e120f7ae29c82e919fafa2f5e7f4641 8d04de0a40cc4419aed0d33a8d5e8eee - 27a851d6a3de43b8a4a4900dfa0c3141 27a851d6a3de43b8a4a4900dfa0c3141] [instance: b331987c-d2b6-46c8-ba7a-fc46cfa7e8d0] Took 5.66 seconds to build instance.

I checked again the clocks and they are all perfectly synchronized with the controller host.

To summarize our problem:
There is a time frame of more than 30 seconds, where not haproxy for metadata agent is running in the L3 router network namespace. This results then in the problem, that about 1 out of 10 VMs boots fast enough to get to the cloud-init requests earlier as haproxy is started again.