CI: compute node networking unresponsive after os-net-config run

Bug #1555749 reported by James Slagle on 2016-03-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Unassigned

Bug Description

Was investigating one of the jobs that was going to time out:
http://logs.openstack.org/36/291136/2/check-tripleo/gate-tripleo-ci-f22-ceph/617cc22//console.html

The compute node had deployed, but was not picking up applying any SoftwareDeployments.

I could also not ssh into the node's IP, nor ping it: 192.0.2.9

Eventually the job timed out.

I saved the vm's disk for investigation aftewards.

It looks like nic5 was not mapped appropriately to eth4 by os-net-config, and then os-net-config tracebacked trying to add the non-existant interface "nic5" to the bridge:

Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: Traceback (most recent call last):
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/bin/os-net-config", line 10, in <module>
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: sys.exit(main())
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 185, in main
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: provider.add_object(obj)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 52, in add_object
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: self.add_bridge(obj)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 260, in add_bridge
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: data = self._add_common(bridge)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 112, in _add_common
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: mac = utils.interface_mac(base_opt.primary_interface_name)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 46, in interface_mac
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: with open('/sys/class/net/%s/address' % name, 'r') as f:
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: IOError: [Errno 2] No such file or directory: '/sys/class/net/nic5/address'
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: + RETVAL=1

os-net-config then revert to the fallback mode triggered via os-refresh-config/configure.d/20-os-net-config, and that actually resulted in the node no longer being able to reach it's gateway, 192.0.2.1 at all.

You can see this in the os-collect-config where it fails to collect any metadata from this point on.

James Slagle (james-slagle) wrote :

the full system journal from the failed compute node

it shows the os-net-config output from os-collect-config

James Slagle (james-slagle) wrote :

the journal also shows something else that needs fixing...eth1 tried dhcp on initial bootup, triggered by the start of the network service. This took 5 minutes to time out, and delayed the boot by 5 minutes.

If this is happening on all nodes, it's costing us that time in all the CI Jobs.

Dan Prince (dan-prince) wrote :

One possible culprit is to revert this patch which landed on March 2nd (last week):

https://review.openstack.org/#/c/291322/

Changed in tripleo:
importance: Undecided → High
status: New → Triaged
Ben Nemec (bnemec) wrote :

Marking fixed because I don't think this is happening anymore. The five minute delay has also been fixed.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments