All OVB jobs fail to get http://169.254.169.254/openstack/2015-10-15/meta_data.json

Bug #1790127 reported by Sorin Sbarnea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Kieran Forde

Bug Description

Found this error as part of the gate-checking at https://review.openstack.org/#/c/598095/

++(/opt/stack/new/tripleo-ci/toci_gate_test.sh:128): curl http://169.254.169.254/openstack/2015-10-15/meta_data.json
2018-08-31 07:54:47.882406 | tripleo-ovb-centos-7 | ++(/opt/stack/new/tripleo-ci/toci_gate_test.sh:128): python -c 'import json, sys; print json.load(sys.stdin)["uuid"]'
2018-08-31 07:54:47.886784 | tripleo-ovb-centos-7 | % Total % Received % Xferd Average Speed Time Time Time Current
2018-08-31 07:54:47.887342 | tripleo-ovb-centos-7 | Dload Upload Total Spent Left Speed
2018-08-31 07:56:55.075474 | tripleo-ovb-centos-7 |
  0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0

  0 0 0 0 0 0 0 0 --:--:-- 0:02:07 --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection timed out
2018-08-31 07:56:55.077123 | tripleo-ovb-centos-7 | Traceback (most recent call last):
2018-08-31 07:56:55.077285 | tripleo-ovb-centos-7 | File "", line 1, in
2018-08-31 07:56:55.077373 | tripleo-ovb-centos-7 | File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
2018-08-31 07:56:55.078468 | tripleo-ovb-centos-7 | **kw)
2018-08-31 07:56:55.078637 | tripleo-ovb-centos-7 | File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
2018-08-31 07:56:55.078700 | tripleo-ovb-centos-7 | return _default_decoder.decode(s)
2018-08-31 07:56:55.078790 | tripleo-ovb-centos-7 | File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
2018-08-31 07:56:55.079544 | tripleo-ovb-centos-7 | obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2018-08-31 07:56:55.079666 | tripleo-ovb-centos-7 | File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
2018-08-31 07:56:55.079776 | tripleo-ovb-centos-7 | raise ValueError("No JSON object could be decoded")
2018-08-31 07:56:55.079838 | tripleo-ovb-centos-7 | ValueError: No JSON object could be decoded
2018-08-31 07:56:55.083051 | tripleo-ovb-centos-7 | +(/opt/stack/new/tripleo-ci/toci_gate_test.sh:128): UCINSTANCEID=
2018-08-31 07:56:55.083897 | tripleo-ovb-centos-7 | ERROR: the main setup script run by this job failed - exit code: 1
2018-08-31 07:56:55.084021 | tripleo-ovb-centos-7 | please look at the relevant log files to determine the root cause
2018-08-31 07:56:55.084075 | tripleo-ovb-centos-7 | Running devstack worlddump.py
2018-08-31 07:56:55.169476 | tripleo-ovb-centos-7 | /bin/sh: brctl: command not found
2018-08-31 07:56:56.253905 | tripleo-ovb-centos-7 | Cleaning up host
2018-08-31 07:56:56.254157 | tripleo-ovb-centos-7 | ... this takes 3 - 4 minutes (logs at logs/devstack-gate-cleanup-host.txt.gz)
2018-08-31 07:57:15.135075 | tripleo-ovb-centos-7 | [WARNING]: Could not match supplied host pattern, ignoring: subnodes
2018-08-31 07:57:15.135343 | tripleo-ovb-centos-7 | [WARNING]: No hosts matched, nothing to do
2018-08-31 07:57:17.814971 | tripleo-ovb-centos-7 | Done.
2018-08-31 07:57:19.218804 | tripleo-ovb-centos-7 | *** FAILED with status: 1

See http://logs.rdoproject.org/47/597547/2/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/71355b6/job-output.txt.gz

Sorin Sbarnea (ssbarnea)
Changed in tripleo:
importance: Undecided → Critical
status: New → Triaged
tags: added: promotion-blocker
summary: - curl fails to get
+ All OVB jobs fai fails to get
http://169.254.169.254/openstack/2015-10-15/meta_data.json
tags: added: ci
summary: - All OVB jobs fai fails to get
+ All OVB jobs fail to get
http://169.254.169.254/openstack/2015-10-15/meta_data.json
Revision history for this message
Gabriele Cerami (gcerami) wrote :

Seems that something is wrong with the neutron configuration in the tenant. Instance for example get the 169.254.169.254 range in their routing table, which is wrong. Kieran is currently working on the issue.

Changed in tripleo:
assignee: nobody → Kieran Forde (kieran-forde)
status: Triaged → In Progress
Revision history for this message
Kieran Forde (kieran-forde) wrote :

I've managed to find the root cause for this issue.
Basically since the reboot yesterday *some* users have been experiencing failure to reach metadata from their instances. I tracked this down to a missing PREROUTING rule in the 'qrouter' namespace for the users router. Without this rule an instance will never reach the metadata service.

There are several fixes/patches for this [1] which helps to populate the iptables rules earlier.
This patch only landed in queens and I've requested it to be backported to pike at least.

The delay in fixing this was due to the random nature of the failure which meant our monitoring didn't pick it up. Adding into this a misconfiguration of keepalived which had to be fixed and then the fact that only 2 lines were missing from the iptables rules! This made it hard to spot.

Initial tests showed metadata now reachable and gcerami is testing more completely now.

[1] https://review.openstack.org/#/c/524406/

Revision history for this message
Gabriele Cerami (gcerami) wrote :

I've been monitoring some OVB jobs and they went past the critical point. Curl command can now get the metadata.
We can close this now.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.