CI: compute node networking unresponsive after os-net-config run

Bug #1555749 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

Was investigating one of the jobs that was going to time out:
http://logs.openstack.org/36/291136/2/check-tripleo/gate-tripleo-ci-f22-ceph/617cc22//console.html

The compute node had deployed, but was not picking up applying any SoftwareDeployments.

I could also not ssh into the node's IP, nor ping it: 192.0.2.9

Eventually the job timed out.

I saved the vm's disk for investigation aftewards.

It looks like nic5 was not mapped appropriately to eth4 by os-net-config, and then os-net-config tracebacked trying to add the non-existant interface "nic5" to the bridge:

Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: Traceback (most recent call last):
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/bin/os-net-config", line 10, in <module>
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: sys.exit(main())
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 185, in main
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: provider.add_object(obj)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 52, in add_object
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: self.add_bridge(obj)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 260, in add_bridge
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: data = self._add_common(bridge)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 112, in _add_common
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: mac = utils.interface_mac(base_opt.primary_interface_name)
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 46, in interface_mac
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: with open('/sys/class/net/%s/address' % name, 'r') as f:
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: IOError: [Errno 2] No such file or directory: '/sys/class/net/nic5/address'
Mar 10 15:03:50 overcloud-novacompute-0.localdomain os-collect-config[2280]: + RETVAL=1

os-net-config then revert to the fallback mode triggered via os-refresh-config/configure.d/20-os-net-config, and that actually resulted in the node no longer being able to reach it's gateway, 192.0.2.1 at all.

You can see this in the os-collect-config where it fails to collect any metadata from this point on.

Revision history for this message
James Slagle (james-slagle) wrote :

the full system journal from the failed compute node

it shows the os-net-config output from os-collect-config

Revision history for this message
James Slagle (james-slagle) wrote :

the journal also shows something else that needs fixing...eth1 tried dhcp on initial bootup, triggered by the start of the network service. This took 5 minutes to time out, and delayed the boot by 5 minutes.

If this is happening on all nodes, it's costing us that time in all the CI Jobs.

Revision history for this message
Dan Prince (dan-prince) wrote :

One possible culprit is to revert this patch which landed on March 2nd (last week):

https://review.openstack.org/#/c/291322/

Changed in tripleo:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ben Nemec (bnemec) wrote :

Marking fixed because I don't think this is happening anymore. The five minute delay has also been fixed.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.